6 Tips for Dealing With Null Values

https://miro.medium.com/max/1200/0*Q0g81P6WZsIRu3O4

Original Source Here

Photo by Ben Hershey on Unsplash

6 Tips for Dealing With Null Values

Includes Iterative Method, Mean and Median Fill with Groupby, Mean and Median Fill

Null values are a big problem in machine learning and deep learning. If you are using sklearn, TensorFlow, or any other machine learning or deep learning packages, it is required to clean up null values before you pass your data to the machine learning or deep learning framework. Otherwise, it will give you a long and ugly error message.

In this article, we will work on how to handle null values. First, there will be some very simple methods and slowly we will move toward some more complex and more efficient ways.

To demonstrate the handling of null values, We will use the famous titanic dataset.

import pandas as pd
import numpy as np
import seaborn as sns
titanic = sns.load_dataset("titanic")
titanic

The preview is already showing some null values. Let’s check how many null values are there in each column:

titanic.isnull().sum()

Output:

survived         0
pclass 0
sex 0
age 177
sibsp 0
parch 0
fare 0
embarked 2
class 0
who 0
adult_male 0
deck 688
embark_town 2
alive 0
alone 0
dtype: int64

The age column has 177 and the embark_town column has 2 null values. But the deck column has the most null values 688 out of 891 rows of data. I would like to delete this column completely for machine learning or any other type of data analytics purpose.

We will focus on the age and embark_town columns and deal with the null values of those columns.

Let’s start!

I will start with the most simple strategy and slowly move towards the more complicated tricks.

1. Simply Drop

The most simple strategy is to drop the rows with null values if you have a good enough size of data using this simple code:

titanic.dropna()

But the titanic dataset is not too big. In fact in the real world also we do not have the luxury of deleting the data lots of time because we do not have enough data left after deleting all rows with null values.

2. Filling with Zeros

Another very easy and simple way. You can fill up all the null values with zeros to make the process really simple. We can fill up the null values in the age column with zeros like this:

titanic['age'].fillna(0)

Output:

0      22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 0.0
889 26.0
890 32.0
Name: age, Length: 891, dtype: float64

Look at row 888. It was null before and now it is zero. This is also a very naive approach. Especially, in this case, age cannot be zero.

3. Forward and Backward Fill

This is also a common technique to fill up the null values. Forward fill means, the null value is filled up using the previous value in the series and backward fill means the null value is filled up with the next value in the series.

titanic['age'].ffill()

Output:

0      22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 19.0
889 26.0
890 32.0
Name: age, Length: 891, dtype: float64

Please notice, that row 888 is now 19, the same as the 887.

titanic['age'].bfill()

Output:

0      22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 26.0
889 26.0
890 32.0
Name: age, Length: 891, dtype: float64

This row 888 has taken the value from row 889.

4. Mean and Median Fill

I prefer this one in most cases, filling up the null values by mean and median. Here I am using the median:

titanic['age'].fillna(titanic['age'].median(), inplace = True)
titanic['age']

Output:

0      22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 28.0
889 26.0
890 32.0
Name: age, Length: 891, dtype: float64

Now the null value in row 888 is 28, which is the median of the column age.

5. Mean and Median Fill with Groupby

Putting the median or mean of the whole column was the simple approach. But I like a bit more specific approach to the median and mean. Instead of taking the median of the whole age column and filling up all the null values, filling up the null values using the mean age of each pclass and ‘alive’ will be more accurate.

Let’s see the mean age of ‘pclass’ and ‘alive’ if they vary at all:

titanic.groupby(['pclass', 'alive'])['age'].mean()

Output:

pclass  alive
1 no 43.695312
yes 35.368197
2 no 33.544444
yes 25.901566
3 no 26.555556
yes 20.646118
Name: age, dtype: float64

Yes, they do vary a lot.

It will be more accurate if we fill up the null values of each of these groups with the corresponding means.

titanic['age'].fillna(titanic.groupby(['pclass', 'sex'])['age'].transform('mean'))

Output:

0      22.00
1 38.00
2 26.00
3 35.00
4 35.00
...
886 27.00
887 19.00
888 21.75
889 26.00
890 32.00
Name: age, Length: 891, dtype: float64

Now, row 888 became 21. Compared to 28

Categorical Null Values Imputation

The embark_town column is a categorical column and we can still apply the same process as the previous example with it.

But before that the values of the ‘embark_town’ column need to be expressed as numeric values:

titanic['embark_town'] = titanic['embark_town'].astype('category')
titanic['embark_town'] = titanic['embark_town'].cat.codes
titanic['embark_town']

Output:

0      2
1 0
2 2
3 2
4 2
..
886 2
887 2
888 2
889 0
890 1
Name: embark_town, Length: 891, dtype: int8

I will group the data by ‘pclass’ and ‘alive’ and fill up the null values by the median.

titanic['embark_town'] = titanic['embark_town'].fillna(titanic.groupby(['pclass', 'alive'])['embark_town'].transform('median'))

6. Iterative Imputation With a Machine Learning Model

This is a very good and efficient way of imputing the null values. In this process, null values in each column get filled up. Each column is used as the label of a specified machine learning model one by one.

Then the rows with non-null values are used to train a machine learning model and the rows with null values are predicted using that trained model.

For this demonstration, I am using only a few columns. Remember in the beginning we checked which columns have null values and we saw age, embark_town, and deck columns have null values.

The column deck had too many null values though and we wanted to avoid that column. But for the demonstration purpose, I am adding the deck column also in the titanic1 dataset.

I am using RandomForestRegressor here. You can use any other regression model.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
titanic1 = titanic[['survived', 'pclass', 'age', 'sibsp', 'fare', 'embark_town']]

Let’s do the imputation now and I will save the imputed dataset as titanic2.

imptr = IterativeImputer(RandomForestRegressor(), max_iter=10, random_state=0)
titanic2 = pd.DataFrame(imptr.fit_transform(titanic1), columns = titanic1.columns)
titanic2

The age in 888 row is 32 now!

We should check if there is any null values anymore in titanic2 dataset:

titanic2.isnull().sum()

Output:

survived       0
pclass 0
age 0
sibsp 0
fare 0
deck 0
embark_town 0
dtype: int64

There are no null values in any of the columns anymore.

These are all the tips I wanted to share about null values today.

Conclusion

If you want you can choose individual techniques for individual columns as well. Please feel free to share if you find any other techniques to deal with null values more efficient.

Please feel free to follow me on Twitter, and the Facebook page

More Reading

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: