As a Data Scientist, a huge part of your job involves dealing with data sets. You need these data sets to build, train and deploy models. However, you may have noticed that these sets of data do not always have the same number of variables.Consider a dataset with 2 classes, one class is large and relatively over-represented, the other is smaller, and under-represented. This problem is known as imbalanced classification.
Going ahead to use this imbalanced data set for model training, may lead to getting a biased result. Since you don’t want that to happen, how exactly can you correct it? Especially in a situation where the fault is not in the data collection process, and little can be done to produce data that isn’t there.
In this article, we will discuss ways to handle class imbalance.
Here's A Case Study
For the purposes of this explanation, we will be using the fraud dataset for company XYZ as our sample dataset. It is a tabular dataset with rows and columns.
The rows contain transactions (either fraudulent or equitable) and the columns describe the different features of each transaction (e.g. location, time, user information, amount, etc.).
As you can imagine, the percentage of fraudulent transactions is very small compared to equitable transactions.
In this example, we assume that the XYZ dataset is imbalanced such that only 8% of the data records, represent fraudulent transactions and 92% are non-fraud records.
You can also imagine that whatever model is trained using this dataset (if measures are not taken) will be biased and produce an improper representation of the data distribution.
Our goal is to combat this using techniques that will be described below. But first, what could cause imbalance?
Causes of class imbalance
Major causes of data imbalance include:
Faulty data collection
Data collection is an important part of the data processing. In fact, it makes the top of the list because other steps are heavily dependent on the availability of data. However, if this is faulty, it affects every other step leading to your model. This effectively creates faulty models with low performance.
Peculiarity of the domain
Some domains simply do not have balance in their dataset, examples include fraud data, churn datasets, etc. As is the case with XYZ, a lot cannot be done to the data collection process, to improve data balance.
The number of fraudulent transactions compared to non-fraudulent ones, is very minimal. As a result, one cannot merely go out there to collect more fraudulent transactions.
Here is another familiar example; if we took all 7 billion people on Earth and classified based on those diagnosed with COVID vs those without, that would be a highly imbalanced dataset. But as we will soon see, there are a number of techniques that help resolve this problem.
6 Ways to Handle Class Imbalance
1.Collect More Data
This is probably the most obvious tactic and the first thing to try. Many times, the data collection or sampling process may have been erroneous, ignoring data points that make up the dwarfed class.
Try looking beyond your current data source and find out if more data exists out there. This is especially to be considered if you are using publicly available data.
2. Change How You Measure Performance
People tend to look at accuracy when trying to measure performance of a model. Accuracy however is not a good measure, as it favours the larger class. We advise you look into other metrics like:
- F1 score
3. Generate Data
In fields such as Computer Vision, Cyber Security, and Natural Language Processing where Machine Learning can be applied, it is not uncommon to find models built for the sole purpose of generating samples.
These samples can then be used to train the prediction models, for better performance. A common example of deep learning-based generators is Generative Adversarial Networks (GANs).
4. Over-Sample / Under-Sample the dataset
You can consider this option also. Over-sampling means duplicating records in the smaller, under-represented class while under-sampling means removing data points in the larger class. This can help bring the dataset up to a point where they are equally represented.
5. Use Penalized Models
You could consider penalizing a model. In this context, you can think of it as punishing the model for a wrong prediction on the smaller class.
Some people also call this regularization. Lasso Regression (also called L1) and Ridge regression (also called L2) are popular techniques for doing this.
6. Try Different Algorithms
Sometimes, it may be that you’re using a less optimal algorithm. Some algorithms do not just solve certain problems. Decision Trees, and variants of it, are proven to be effective in imbalanced cases like this, so you should consider it. And coupled with penalization described above, your model stands a better chance at performance.
So there you have it, we’ve taken a look at class imbalance in datasets, the causes and effects of imbalanced data, and several strategies on how to handle them.
It’s important to think outside the box when dealing with abnormalities in data.
We hope the next time you build your model, you’re able to improve performance by using some of the methods above.