Working with imbalanced Datasets.

Original Source Here

Working with imbalanced Datasets.

So you have been doing some deep learning, training some models using TensorFlow, PyTorch, or whatever library you are fond of. You feel like you are getting a grip on this thing and think it can turn out as a possible career option. Then comes the first professional assignment, it could be a freelance project you take up or something your company assigns to you and boom you feel like a person on a raft out in the sea, and nothing to guide you.

What major problems do people face when working with real datasets?

Well, when working with datasets of the self-created origin or something that isn’t a part of the precreated dataset pipelines created out there then there are several problems you may face.

How do I clean this dataset?

What preprocessing steps to perform?

Dealing with missing and misformatted data.

Data inconsistency.

And many more………

You deal with them, tackle them to the best of your capabilities, your model trains well comes up with really attractive looking graphs with that 95 percent test accuracy. You feel proud of a good job done. But then when you deploy it in the real world a completely different picture shows up. You get skewed results, the model keeps predicting one particular class however hard you try to tinker with the inputs. This happened to me twice in the last month.

This is a a small article a memoir of sorts of how i dealed with this problem.

Well one thing i was sure of was the model wasn’t the issue, whatever i tried there was always the issue never seemed to get away. For some reason i decided to take a look at the distribution and saw the problem. I hoped to find a decently balanced dataset, hoped to find the classes that have similar dataset samples. Something like a balanced seesaw. But what I saw looked something like this.

The truth is this will be a very common problem if you venture into the unknown try out of things with uncommon datasets. And the bad part is if not treated will lead to a biased model. One that tends to give preferences to the oversampled class. Two datasets that i can say have this issue is the LCC FASD dataset for anti-spoofing of facial detection and the LIDC-IDRI dataset for lung cancer. One had a large quantities of samples with fake images, and other excessive samples of non lung cancer images their ratio being as bad as 1: 6, 6 images of class one exist for every image in class 2.These two problems are very real and i tackled them both very differently.

Let’s see how.


This is a lung cancer dataset of ct scans done of patients. Obviously, the number of slices of scans that may contain the cancerous cells is much less than the ones containing them. The distribution there looked something like this:

The approach I took here is called Random UnderSampling. Randomly pick up samples from each of the classes so as the number of samples after the process end up being the same. This is a simple technique easy to understand and implement, an easy fix if you don’t mind losing a few samples. The thought if you can afford to lose some samples or not will come with intuition and changes from database to database. Here we can afford to loose a few non cancer samples because what we are looking for is the absence or presence of cancerous cells.


This is a anti-spoofing for facial recognition dataset containing images of people and then samples of presentation attacks, basically showing the same images to different cameras hoping to fool the recognition system.

The distribution here was even more bizarre

And as you can see the problem is evident as a ratio of 1:6.3 for the training set. The earlier method of random undersampling can be used yes, but if we understand the nature of this dataset, go through the images themselves we find that spoofing images is something that is very precious, we cannot afford to lose samples here as each of those samples contains artifacts that are being created when a presentation attack is performed. Losing samples means losing that important information. A major way the network would learn to differentiate between the two classes is the presence or absence of these artifacts. What we could do here is use something called as class weights.

We calculate the ratio of samples between the classes present in the dataset, and when we perform backpropagation, updating the weights, we treat the importance of update for each class as calculated by these class weights.

The updates for class with the lesser samples are given more importance in a hope that in a sense the model has experienced the same volume of data for all the classes. In other words we give more weightage to the minority class in the cost function of the algorithm so that it could provide a higher penalty to the minority class and the algorithm could focus on reducing the errors for the minority class. This technique is risky as if there are any errors that happen during weight updates, they will be multiplied by the magnitude of the class weights themselves, making training unstable a phenomenon undesired by many. Hence this technique should be used with caution.


And these are the two techniques I used to balance my datasets when doing my deep learning shenanigans, there are a lot more out there, maybe I will cover them too, or maybe you the reader have used them already. If that’s the case do let me know.

That’s about it for this, have a good one.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: