Implementing KNN From Scratch

https://miro.medium.com/max/1200/0*AvG16nlDPcL1LGXn

Original Source Here

Implementation From Scratch

Before we dive straight into the actual algorithm, we need to lay some groundwork — define the hyperparameter k and fit the training dataset, or in other terms store the complete training data. This will prepare us for the upcoming computations.

Now, we are ready to move on to the supposedly more interesting stuff.

Calculate the distances

We already stored the complete training dataset, allowing us to compute the distance between a new observation and every row of the training data.

In order to calculate the distance, we define a helper function returning the euclidean distance for each pair of rows.

The euclidean distance is simply the square root of the squared difference between two rows (vectors)— or more formally stated as the following equation:

The implementation of the helper function is relatively straightforward since we can rely on NumPy to handle the vector operations.

Once defined, we can call the helper function and store all distances between the new observation and the complete training data in a simple list.

Get k-nearest indices

The next steps are pretty straightforward as well. We need to sort the list of distances in ascending order (from smallest to largest), allowing us to select only the k-nearest neighbors.

We can think of the selection radius defined by k with the help of the following illustration:

An example illustration of k-selection-radius [Image by Author]

Note: The value of k is an important parameter since a large value of k widens the radius, allowing for better generalization. A small value of k on the other hand gives more flexibility while being more vulnerable to noise. We can use visualization techniques such as the “Elbow-Method” to tune the hyperparameter.

We can sort and select the indices with the built-in NumPy function numpy.argsort() and with the use of the Python slicing notation.

Return the most-common label

Now, we only have one thing left to do — we need to retrieve the corresponding class labels and return the most-common label as our prediction.

Based on the indices of the k-nearest neighbors, we can select the associated labels from our training data and store them in a list.

Utilizing two different NumPy functions — numpy.bincount() and numpy.argmax() — we can select the most-common label as our return value.

Prediction and putting it all together

We already completed most of the hard work by now.

The last thing we need to do is to repeat the steps for every row in the test dataset, which finishes our implementation and puts it all together.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: