Top 3 Visualization Python Packages to Help Your Data Science Activities

https://miro.medium.com/max/1200/0*31iBxmqQ1yRnvT-I

Original Source Here

2. Missingno

Data exploration is not limited to the data present in the dataset, but it includes the missing data from your dataset. There are cases that missing data happen because of an accident or pure chance, but this is often not true. Missing data might uncover insight that we never knew previously. If you want to understand the classification of missing data, you can read it in the article below.

Introducing missingno, a package specifically developed to visualize your missing data. This package provides an easy-to-use insightful one-liner code to interpret the missing data and shows the missing data relationship between features. Let’s try this package to get a better understanding.

First, we need to install the missingno package.

pip install missingno

For this example, I would use the Missing Migrants project dataset from Kaggle.

df = pd.read_csv('MissingMigrants-Global-2019-12-31_correct.csv')
df.info()
Image from Author

The dataset contains a lot of missing data with different amounts in each column. Why does this is happening? Are there missing data patterns here? Let’s find it out using missingno. To begin, let’s visualize the missing data numbers.

missingno.bar(df, figsize=(10,5), fontsize=12, sort="ascending")
Missing data bar chart (Image by Author)

The function above produces a bar chart to visualize the number of data in each column. As we can see, ‘Minimum Estimated Number of Missing’ has the most missing data, followed by ‘Number of Children’, ‘Number of Survivors’, and so on. If you want to calculate the missing data in the log number, we can use the following code.

missingno.bar(df, log=True,figsize=(10,5), color="tab:green", fontsize=12,sort="ascending")
Log Missing Data Number (Image by Author)

Log numbers could help you know the percentages of the missing data in each column. We can see that the ‘Minimum Estimated Number of Missing’ column only accounts for less than 10% of the total data present, and the rest is missing.

Missing data could have a pattern, whether it is because of the presence of another column, timely manner, or purely chance. To visualize this missing data pattern, let’s try to visualize the missing data location in the dataset using a matrix plot.

missingno.matrix(df,figsize=(10,5), fontsize=12)
Missing Data Matrix (Image by Author)

To give a context, our Migrant Missing Project dataset is sorted by the time (from 2014–2019) from recent to the oldest. The top data position is recent (2019), and the lower data position is the oldest (2014).

If we look at the graph above, the ‘URL’ missing data seems more present in the older time and similar to the ‘Number of Males’. It is different from the ‘Migration Routes’ column that has increasingly missed data in recent times. The graph above could give us a better insight as to what happened in our dataset.

Using missingno, we could visualize the nullity correlation (range -1 to 1) to measure the missing data relationship between features. Let’s try it.

missingno.heatmap(df, figsize=(10,5), fontsize=12)
Nullity Correlation Heatmap (Image by Author)

The nullity correlation gives us the relationship between columns missing data. The closer the score is to -1 means where one column’s data is present, the other would be missing. In contrast, where it is closer to 1, the data is present when other column data is present. 0 means no correlation between the features.

To understand even deeper the missing data relationship between features, we could use missingno to build the dendrogram based on a hierarchical clustering algorithm and the nullity correlation.

missingno.dendrogram(df, figsize=(10,5), fontsize=12)
Missing Data Dendrogram (Image by Author)

To interpret the dendrogram above, we would look at it from a top-down perspective. The features or clusters linked together in the nearest distance are shown to predict each other missing data or present data better. For example, the features’ Number of Survivors’ and ‘Minimum Estimated Number of Missing’ are clustered together earlier than the other — means they predict each other better than the other features.

The cluster feature linked together in the zero distance means they fully predict one another (One is missing, then the other is present, or both are always missing/present). The leaves cluster that split not at zero means they could predict each other but might be imperfect (the closer to zero, the better they predict each other missing data/present data presence).

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: