Original Source Here
#4 — KNN Imputation
Let’s explore something a bit more advanced next. KNN stands for K-Nearest Neighbors, a simple algorithm that makes predictions based on a defined number of nearest neighbors. It calculates distances from an instance you want to classify to every other instance in the dataset. In this example, classification means imputation.
Since KNN is a distance-based algorithm, you should consider scaling your dataset. You’ll see how in a bit.
Advantages: KNN imputation is easy to implement and optimize, and it also seems “smarter” than the previous techniques.
Disadvantages: It is sensitive to outliers due to the Euclidean distance formula. It can’t be applied to categorical data, and can be computationally expensive on large datasets.
Let’s start with dataset scaling first. We’ll also work with the unscaled dataset, so a fair comparison can be made afterward. The following code snippet uses
MinMaxScaler to scale the dataset:
Here’s how the scaled dataset looks like:
Let’s perform the imputation now. You’ll need to know the value for the
n_neighbors parameter, but that’s something you can optimize later. A value of 3 should suit us fine. After the imputation, we’ll have to use the
inverse_transform() function from
MinMaxScaler to bring the scaled dataset in the original form. Here’s the code:
Finally, let’s explore the results:
Here are the summary statistics:
The summary statistics look impressive, but let’s explore the results visually before jumping to conclusions:
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot