How to Address Data Bias in Machine Learning*TzmjPQdocqVU1io4

Original Source Here

How to Address Data Bias in Machine Learning

Understanding what bias actually is and taking the right steps to prevent it can be quite useful in the field of data science.

Photo by Bodie Pyndus on Unsplash

Well, the company has spent a significant amount of revenue to help grow their business with the aid of machine learning. As a person who is mostly involved in the data cleaning and data preparation along with performing valuable predictions for the company, there is one more important factor to be considered when trying to deploy the ML models in production. It is during this time that we should also consider the ethical implications of AI and how biased our models are in predicting the outcomes. The fundamental question, therefore, should be as to what actually bias is in machine learning. Let us now tackle this area of machine learning as a result of which we can produce a robust model that also considers factors that are ethical as well.

What is Data Bias?

Photo by Alexander Schimmeck on Unsplash

Whenever we are feeding the data to the ML model, the quality of data and how related the features are with the outcome usually determines the quality of the predictive model. If there are features that are not highly correlated with the outcome, there is a minimal chance of success with using the models from the simplest of models to the most complex models. Therefore, the most important aspect that must be considered as a data scientist or a machine learning engineer is the quality of data being given to various models.

Now we are slowly approaching the definition of data bias. If you could bear with me, you will get the explanation in a couple of lines. Since we have learned that the data forms the basis of how well the models actually perform on the test data (unseen data), it is fundamental to assume that the quality of data determines how well the models perform on the test data. If the data that we use for ML models contains a lot of information for a particular set of classes, they might be performing better on those particular set of classes than on the others.

In order to put it in context, imagine we are trying to predict whether a person is going to default on a loan based on a set of features such as region, gender, race, social status, income and job. Hence, there can be some features that are highly available for the model for predictions. When we consider examples such as ‘gender’ and see that a significantly large portion of the people represented are male, the model would be learning a lot about the male borrowers and whether they are going to be paying back the loan or not compared to the female borrowers. A similar argument can be made with race as well. In this way, the model is inherently learning that being a ‘male’ can have a lower chance of defaulting on a loan or vice-versa. When we consider metrics such as accuracy to see how well the model is performing, we might get an inflated measure about its performance despite it not performing well on the minority class. This is known as data bias where there is an overrepresentation of certain category of features compared to others.

While the biases can often be unintentional, the consequence of their presence can be quite significant to the groups of people who are affected by it. Consider the examples of Amazon’s hiring algorithm that systematically screened out women candidates. Similarly, Microsoft’s Twitter bot has been accused of being racist due to the outcomes it provided, and the feeds generated by it. If results show that the model results are more biased towards a specific group of people, there is a higher chance that the users would lose trust and never use the models in the future.

After having to go through such a long explanation, we can come to the conclusion that steps should be taken to overcome it so that we can build more trust in the models.

How to overcome Data Bias in Machine Learning

Photo by Agence Olloweb on Unsplash

There are many ways at which data bias can be reduced with the help of the right strategies and tools. The first way to reduce bias is to get to know where exactly it is taking place. If we know the areas to focus on where there can be a possibility of bias, we can then take the right steps to determine the kind of actions that help in reducing this bias to a large extent. Let us now go over the various ways at which we might reduce the bias to a large extent.

Determining the Right Machine Learning Model

When we are using various ML models, whether we are doing the task of supervised machine learning or unsupervised machine learning, the bias could be either the model learning various representations between the input and the output which helps it learn it. If we were to change a few hyperparameters or change a few sets of things in the model, we are then getting the best models and the performance across all the categories tend to be quite consistent.

Imagine that you were asked by your team to design a machine learning model that can capture whether a person has a higher chance of getting cancer based on a set of characteristics such as age, gender, blood pressure, and many others. There can be a few models that learn representations about features such as gender and age. These features can sometimes be used and given more importance compared to other factors. If this is the case, there is a higher possibility of a specific gender or age to suffer from cancer according to the model. Hence, the model can be biased in this case where it is just predicting the outcome based on just the gender or age as an important factor. One of the best ways to combat this challenge is to use various tools that can build interpretability of the models. If we get to know why the model has taken a decision in the first place, we can then determine whether the model is biased or not. There are tools such as LIME (Local Interpretable Model-Agnostic Explanations) that can also help us determine why the model has given a particular decision helping doctors along the way. There are other tools as well such as SHAP (Shapley values) which could also be used for interpretability. It would be good to give your team insights about why the models are giving particular decisions or outcomes.

Giving Proper Documentation for Data Used

Documenting the data can be helpful in two ways. One, it can be helpful to get us to understand the various features in our model and their impact on the outcome. Two, it can also lead to cases where we could identify bias in the data by taking a look at the distribution of the data. Having a proper documentation of data could also ensure that others who are using it understand the presence of various features that are influential in the model predictions and also the presence or overrepresentation of various groups in the models.

Therefore, when we are trying to build a solution with machine learning with the use of data, it can be handy if there is documentation of the features provided. Consider an example of predicting whether a given text is positive or negative. In this case, we would take a look at various features with the help of natural language processing (NLP). When we are using this solution, it can be quite useful if there is a documentation of data and the features that we are using for predicting the text sentiment. If there are overwhelming number of positive texts and only a few negative texts, it can be hard for models to do well when there is a negative review. Furthermore, taking a look at the documentation for various features used can help us get a solid understanding of the data and how influential is each feature in model predictions. This is possible if there is proper documentation of the data so that members in your team can access it and understand fully before using machine learning models for prediction.

Evaluate Model Performance for Various Categories

When building the ML models that could be used for production, they tend to perform well for a specific set of categories in our data. When we consider protected features such as age, gender, and sex, they can perform well for a certain group or categories compared to others. In order to combat the bias, we would have to ensure that the models perform well on all the categories and not just a single category. Therefore, we must account for performance for each of these subgroups and understand whether the performance is consistent across all the groups. In this way, there is a good possibility of reducing bias in the models.

Consider a situation where your model is performing quite well on the test data (unseen data) for predicting whether a mail is spam or ham. We know in real life that a large portion of our mails are ham and there can only be a few instances where there are spam emails. Hence the data that is available for us to train the models largely contains class imbalance where there are a greater number of ham mails compared to spam mails. In this case, it can be quite useful if we evaluate the performance for each individual class instead of focusing entirely on the data. In this way, we do a good job of evaluating the performance for each individual class instead of the whole data leading to reduction of bias towards a particular class (majority class).

Spread More Awareness

While there are many qualified data scientists tackling some of the most complex problems in companies, there are a few who do not give quite as much importance to the ethical aspects of artificial intelligence. Spreading the awareness about the presence of bias in machine learning could be quite useful, especially when action could be taken by combating it. It can also be beneficial to add more content in online courses about the ethical side of machine learning.

When looking at the most recent news, we learn that companies such as Google and Microsoft are taking steps to spread more awareness about the ethics of AI. Similarly, organizations can take action and also make people more aware about the issue of bias with the data and the implications it can have on various protected classes. When they take the right steps and be more transparent about the model predictions, a lot of people can trust these black-box models and use them in their future endeavors.


All in all, we’ve seen that there could be data bias that is used in machine learning models for predictions. Taking the right steps to remove bias from data can be handy, especially when we acknowledge the ethical side of artificial intelligence. Steps that could be followed to remove bias in machine learning would be to determine the right ML models, evaluate model performance on various categories and spread more awareness of bias. Thank you for taking the time to read this article. Feel free to share your thoughts as well.

If you like to get more updates about my latest articles and also have unlimited access to the medium articles for just 5 dollars per month, feel free to use the link below to add your support for my work. Thanks.

Below are the ways where you could contact me or take a look at my work.

GitHub: suhasmaddali (Suhas Maddali ) (

LinkedIn: (1) Suhas Maddali, Northeastern University, Data Science | LinkedIn

Medium: Suhas Maddali — Medium


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: