# SMOTE

In this article, you’ll learn everything that you need to know about SMOTE. SMOTE is a machine learning technique that solves problems that occur when using an imbalanced data set. Imbalanced data sets often occur in practice, and it is crucial to master the tools needed to work with this type of data.

# SMOTE: a powerful solution for imbalanced data

SMOTE stands for Synthetic Minority Oversampling Technique. The method was proposed in a 2002 paper in the Journal of Artificial Intelligence Research. SMOTE is an improved method of dealing with imbalanced data in classification problems.

# When to use SMOTE?

To get started, let’s review what imbalanced data exactly is and when it occurs.

Imbalanced data is data in which observed frequencies are very different across the different possible values of a categorical variable. Basically, there are many observations of some type and very few of another type.

SMOTE is a solution when you have imbalanced data.

As an example, imagine a data set about sales of a new product for mountain sports. For simplicity, let’s say that the website sells to two types of clients: skiers and climbers.

For each visitor, we also record whether the visitor buys the new mountain product. Imagine that we want to make a classification model that allows us to use customer data to make a prediction of whether the visitor will buy the new product.

Most e-commerce shoppers do not buy: often, many come for looking at products and only a small percentage of visitors actually buy something. Our data set will be imbalanced, because we have a huge number of non-buyers and a very small number of buyers.

The following schema represents our example situation:

# Why is imbalanced data a problem?

In the data example, you see that we have had 30 website visits. 20 of them are skiers and 10 are climbers. The goal is to build a machine learning model that can predict whether a visitor will buy.

This example has only 1 independent variable: whether the visitor is a skier or a climber. As a thought experiment, let’s consider two very simple models:

• a model that uses the variable “skier vs climber”
• a model that does not use the variable “skier vs climber”

I want to avoid going in-depth into different machine learning algorithms here, but let’s just see from a logical analysis whether it is useful to use the independent variable for predicting buyers.

10% of climbers buy, whereas only 5% of skiers buy. Based on this data, we could say that climbers are more likely to buy than skiers. However, this does not help the model in deciding to predict “buy” or “not buy” for a visitor.

Accuracy is a bad machine learning metric when working with imbalanced data.

To split the 30 people into buyers/non-buyers, the only thing that a model could really do here is to predict “not buy” for everyone. Skiers are more likely not to buy than to buy. Climbers are also more likely not to buy. Predicting “not buy” for everyone is the only option here.

The tricky thing here is that the model predicting “not buy” for everyone is correct in 28 out of 30 cases. This converts to an accuracy of 28 out of 30, which is 93%! Using imbalanced data, we have just made a model that appears very accurate, while it actually is useless!

# Undersampling

Before diving into the details of SMOTE, let’s first look into a few simple and intuitive methods to counteract class imbalance!

The most straightforward method to counteract class imbalance is undersampling. Undersampling means that you discard a number of data points of the class that is present too often.

The disadvantage of undersampling is that you lose a lot of valuable data

For the mountain website example, we had two options: “buy” and “not buy”. We had 28 non-buyers and 2 buyers. If we would do an undersampling, we would randomly delete a large number of non-buyers from our data set.

The advantage of undersampling is that it is a very straightforward technique to reduce class imbalance. However, it is a huge disadvantage that we need to delete a large amount of data.

In the presented example, undersampling is definitely not a good idea, because we would end up with almost no data. Undersampling might be effective when there is a lot of data, and the class imbalance is not so large. In an example with 40% buyers and 60% non-buyers, undersampling would not delete so much data, and it might therefore be effective.

# Oversampling

Another simple solution to imbalanced data is oversampling. Oversampling is the opposite of undersampling. Oversampling means making duplicates of the data that is the least present in your data set. You then add those duplicates to your data set.

Let’s apply this to the example of the mountain sports website. We had 2 buyers of our product against 28 non-buyers. If we would oversample, we could duplicate the buyers 16 times and obtain a data set with 28 buyers and 28 non-buyers.

The disadvantage of oversampling is that it creates many duplicate data points.

The advantage of this is that you do not have to delete data points, so you do not delete and valuable information. On the other hand, you are creating data that is not real, so you may be introducing false information into your model.

Clearly, in our mountain sports example, we do not have enough data points to even think about oversampling. We would end up with many identical data points, and this would definitely be problematic for any machine learning algorithm.

However, in less extreme cases, applying a random oversampling may actually be performant. When doing this, it is important to assess the predictive performance of your machine learning model on a non-oversampled data set. After all, your out-of-sample predictions will be done on non-oversampled data and therefore this is how you should measure your model’s performance.

# Data Augmentation

Data Augmentation is a method that works much like oversampling. Yet Data Augmentation adds a twist: rather than making exact duplicates of observations in the less present class, you will add small perturbations to the copied data points.

The small perturbations depend on the type of data that you have. The method is often used for image treatment models like object detection or image segmentation, in which you can simply twist, turn and stretch the input images to obtain similar yet different images.

In tabular data, you could think about adding small random noise to the values so that they are slightly different from the original. You can also create synthetic data based on the original data.

# The SMOTE algorithm

SMOTE is an algorithm that performs data augmentation by creating synthetic data points based on the original data points. SMOTE can be seen as an advanced version of oversampling, or as a specific algorithm for data augmentation. The advantage of SMOTE is that you are not generating duplicates, but rather creating synthetic data points that are slightly different from the original data points.

SMOTE is an improved alternative for oversampling

The SMOTE algorithm works as follows:

• You draw a random sample from the minority class.
• For the observations in this sample, you will identify the k nearest neighbors.
• You will then take one of those neighbors and identify the vector between the current data point and the selected neighbor.
• You multiply the vector by a random number between 0 and 1.
• To obtain the synthetic data point, you add this to the current data point.

This operation is actually very much like slightly moving the data point in the direction of its neighbor. This way, you make sure that your synthetic data point is not an exact copy of an existing data point while making sure that it is also not too different from the known observations in your minority class.

For more details on the algorithm, you can check out the paper that introduced SMOTE over here.

## SMOTE influences precision vs. recall

In the previously presented mountain sports example, we have looked at the overall accuracy of the model. Accuracy measures the percentages of predictions that you got right. In classification problems, we generally want to go a bit further than that and take into account predictive performance for each class.

In binary classification, the confusion matrix is a machine learning metric that shows the number of:

• true positives (the model correctly predicted true)
• false positives (the model incorrectly predicted true)
• true negatives (the model correctly predicted false)
• false negatives (the model incorrectly predicted false)

In this context, we also talk about precision vs. recall. Precision means how well a model succeeds in identifying ONLY positive cases. Recall means how well a model succeeds in identifying ALL the positive cases within the data.

True positives and true negatives are both correct predictions: having many of those is the ideal situation. False positives and false negatives are both wrong predictions: having little of them is the ideal case as well. Yet in many cases, we may prefer having false positives rather than having false negatives.

When machine learning is used for automating business processes, false negatives (positives that are predicted as negative) will not show up anywhere and will probably never be detected, whereas false positives (negatives that are wrongly predicted as positive) will generally be filtered out quite easily in later manual checks that many businesses have in place.

In many business cases, false positives are less problematic than false negatives.

An obvious example would be testing for the coronavirus. Imagine that sick people take a test and they obtain a false negative: they will go out and infect other people. On the other hand, if they are false positive they will be obliged to stay home: not ideal, but at least they do not form a public health hazard.

When we have a strong class imbalance, we have very few cases in one class, resulting in the model hardly ever predicting that class. Using SMOTE we can tweak the model to reduce false negatives, at the cost of increasing false positives. The result of using SMOTE is generally an increase in recall, at the cost of lower precision. This means that we will add more predictions of the minority class: some of them correct (increasing recall), but some of them wrong (decreasing precision).

SMOTE increases recall at the cost of lower precision

For example, a model that predicts buyers all the time will be good in terms of recall, as it did identify all the positive cases. Yet it will be bad in terms of precision. The overall model accuracy may also decrease, but this is not a problem: accuracy should not be used as a metric in case of imbalanced data.

Let’s now move on to the code implementation of SMOTE in Python.

# SMOTE in Python

The data set that we’ll use in this example is a simulated data set that is a bit similar to the example that was used earlier on. The following code will import the data into Python directly from a GitHub repository:

If you are not familiar with GitHub, you can check out this short Github tutorial over here.

Once you have imported this data, it will be a data frame that looks as shown below. The data contains four independent variables and one dependent variable (‘buy’). We want to make a classification model that predicts whether a visitor will buy, based on the information in the other four columns.

Since the goal of this article is to introduce SMOTE as a solution for class imbalance, the first thing that we should do is to check out this imbalance in the data. The following code creates a bar graph that shows the class distribution of buyers vs non-buyers.

Using this code, you will obtain the following graph:

We clearly see that there are a lot of non-buyers against a small number of buyers.

## Stratified sampling

In this article, we’ll be creating a train/test split to benchmark the performance of our machine learning model on a data set that was not included in the model training. If you are not familiar with the train/test approach, I advise you to check out this article on the overall flow of machine learning projects.

When working with imbalanced data, use stratified sampling for your train/test split.

In cases with balanced data, we can generate a train/test set simply by randomly assigning 30% of the data to a test set. However, in imbalanced data, this should be avoided. There is a big risk of ending up with almost no cases of the minority class in the test set.

Stratified sampling is a solution to this. Stratified sampling will force the same class balance on the train and test dataset as the original data. You can do stratified sampling using scikitlearn’s train_test_split as follows:

Let’s now verify with a graph that the class distribution in train is the same as in the original data. This is an important check that has to be done before going into the modeling.

You will obtain the following graph. It confirms that the class distribution is equal, thanks to the stratified sampling that we used in scikitlearn’s train_test_split function.

For absolute security, let’s also check whether the stratification went well in the test data. You can use the following code to do this:

This code will create the same bar graph as before, except that it uses the test data as input. You will obtain the following graph:

Everything looks good: the training data and the test data both have the same class distribution as the original data. We can now move on to make a machine learning model for predicting which website visitors will end up buying something.

The machine learning model that we will use is the Logistic Regression. Logistic Regression is one of the easiest models that we can use for classification. It is generally the first model that you should try out when working on classification problems.

Let’s first build a Logistic Regression model on the original data, so that we have a benchmark for the performances when using SMOTE. You can use the following code to fit a Logistic Regression to the training data and create predictions on the test data.

To evaluate the predictive performance of this model, let’s start by looking at the confusion matrix. The confusion matrix shows predicted outcomes on one side and actual outcomes on the other side.

If you have more than two outcomes, the confusion matrix will give you the exact details for each combination of predicted and actual classes. When you have only two outcome classes, this translates to true positives, true negatives, false positives, and false negatives.

You can obtain the confusion matrix using the following code:

Outputs can differ slightly, as the random train/test split will cause differences in the exact test set. However, in my case, the obtained output is the following:

• True negatives: 281
• False positives: 4
• False negatives: 12
• True Positives: 3

A second classification metric that we will use is scikitlearn’s classification report. It is a very useful tool that extracts numerous metrics about our model. You can obtain it as follows:

The obtained report is shown below. Interesting things to look at are precision for each class (0 = nonbuyers, 1 = buyers) and recall for each class.

## Redoing the Logistic Regression with SMOTE

As a next step, we are going to use the confusion matrix and the classification report as a benchmark. We will now do apply SMOTE to reduce the class imbalance and compare the same metrics before and after the application of SMOTE.

The imblearn package is great for SMOTE in Python

The first step is to use the SMOTE function in the imblearn package to create resampled datasets of X and y. This can be done as follows:

Let’s verify what this has done to our class imbalance. The following code will result in the same bar graph that we created earlier:

You will obtain the following graph:

This graph clearly shows that we now have a huge number of buyers that we did not have before. Those are all synthetic data points that have been created thanks to SMOTE.

Let’s now redo the model to investigate the effect of SMOTE on our classification metrics. You can redo the model with the following code:

We now redo the metrics that we also did in the previous model. This will allow us to compare the two and estimate what the impact of SMOTE has been. You can obtain the confusion matrix as follows:

You will obtain the following results (they may be slightly different due to the random creation of the train/test set):

• True negatives: 265 (was 281 so this has reduced with SMOTE)
• False positives: 20 (was 4 for so this has increased with SMOTE)
• False negatives: 2 (was 12 so this has reduced with SMOTE)
• True positives: 13 (was 3 so this has increased with SMOTE)

This shows that SMOTE has caused overall accuracy to decrease (less correct cases overall). However, thanks to SMOTE we did succeed in increasing the number of true positives (correctly identified buyers) substantially.

This confirms that, as explained earlier on, SMOTE is great when you want to shift your errors towards false positives rather than false negatives. In many business cases, false positives are less problematic than false negatives.

Let’s also generate the classification report. This can be done as follows:

The obtained classification report is shown here:

We can observe a number of changes when comparing this classification report to the previous one:

• Recall of nonbuyers went down from 0.99 to 0.93: there are more nonbuyers that we did not succeed to find
• Recall of buyers went up from 0.4 to 0.87: we succeeded to identify many more buyers
• The precision of buyers went down from 0.77 to 0.39: the cost of correctly identifying more buyers is that we now also incorrectly identify more buyers (identifying visitors as buyers while they are actually nonbuyers)!

This confirms the conclusion that we are now better able to find buyers, at the cost of also wrongly classifying more nonbuyers as buyers.

# Conclusion

Throughout this article, you have discovered the SMOTE algorithm as a solution for imbalanced data in classification problems. SMOTE is an intelligent alternative to oversampling: rather than creating duplicates of the minority class, it creates synthetic data points that are relatively similar to the original ones.

Using SMOTE, your model will start detecting more cases of the minority class, which will result in an increased recall, but a decreased precision. The decision of whether this is wanted behavior will always be depending on your business case.

You have also seen how to implement SMOTE in Python. Using the SMOTE function in the imblearn package and a Logistic Regression on website sales data, you have confirmed that SMOTE makes for a higher recall of the minority class at the cost of lower precision.