Imputing Numerical Data: Top 5 Techniques Every Data Scientist Must Know

Original Source Here

#4 — KNN Imputation

Let’s explore something a bit more advanced next. KNN stands for K-Nearest Neighbors, a simple algorithm that makes predictions based on a defined number of nearest neighbors. It calculates distances from an instance you want to classify to every other instance in the dataset. In this example, classification means imputation.

Since KNN is a distance-based algorithm, you should consider scaling your dataset. You’ll see how in a bit.

Advantages: KNN imputation is easy to implement and optimize, and it also seems “smarter” than the previous techniques.

Disadvantages: It is sensitive to outliers due to the Euclidean distance formula. It can’t be applied to categorical data, and can be computationally expensive on large datasets.

Let’s start with dataset scaling first. We’ll also work with the unscaled dataset, so a fair comparison can be made afterward. The following code snippet uses MinMaxScaler to scale the dataset:

Here’s how the scaled dataset looks like:

Image 9 — Scaled version of the original dataset (image by author)

Let’s perform the imputation now. You’ll need to know the value for the n_neighbors parameter, but that’s something you can optimize later. A value of 3 should suit us fine. After the imputation, we’ll have to use the inverse_transform() function from MinMaxScaler to bring the scaled dataset in the original form. Here’s the code:

Finally, let’s explore the results:

Here are the summary statistics:

Image 10 — Summary statistics for Mean/Median/Mode Imputation (image by author)

The summary statistics look impressive, but let’s explore the results visually before jumping to conclusions:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: