Supervised Learning — K Nearest Neighbors Algorithm (KNN)

Original Source Here

Supervised Learning — K Nearest Neighbors Algorithm (KNN)

This article explains one of the simplest machine learning algorithm K Nearest Neighbors(KNN). KNN classifier and KNN regression are explained with examples in this article.

k Nearest Neighbors Classification

K nearest neighbors algorithm basically predicts on the principle that the data is in the same class as the nearest data. According to name of the algorithm, “nearest neighbors” represents the closest data and “k” represents how many closest data is chosen. K value is a hyper parameter so it is tuned by the user and each trial usually gives different results. For example, let’s show the data in with 2 inputs(features) in the coordinate plane which x and y axis represent the inputs(features) of the dataset. As seen in Figure 1, blue circles and orange triangles represent our dataset, stars represent the data which is wondered the classes and they are put the model which is created with knn algorithm in order to predict classes of them. The algorithm assigns the data to the class of the closest data. In short Figure 1 illustrates that how does the algorithm predict a class of new value with different k(k=1(above), k=3(below)) values. This kind of problems that there are 2 outputs which are circle and triangle are called as binary classification. But knn also can apply for multiple classification. Now let’s take a closer look at the 3-neighbors classification example.

Figure 1. Prediction for k=1 and k=3

Before I start the example, I should mention about the sklearn library. It has already mentioned that knn algorithm makes prediction based on the closest data. This implementation can also be created manually with codes (google “knn algorithm without sklearn”). Instead, sklearn library ensure us to use knn algorithm with a few line of codes. By the way not only sklearn but also websites of all kind of algorithms guide while implementing the algorithm(google“sklearn.neighbors.KNeighborsClassifier”) . As it turns out, knn algorithm which is tuned the hyper parameter(k) by the user is one of the simplest algorithm in machine learning.

Figure 2. Implement knn with codes

Let’s continue the example, firstly dataset is separated into training set and test set by the train_test_split. After choosing 20% of the dataset is test set and 80% of the dataset is training set, our KNN model with 3 neighbors is defined as “clf” then model is trained with .fit command as seen in figure 2. After that, test set which is separated in the beginning is put to the model and is provided making prediction of the inputs with the command of clf.predict. Finally, prediction of the test set and test set which is known real outputs are compared with the clf.score command and accuracy rate is presented.

Figure 3. Borders of the classifications

The borders of the classification are shown in different k values in figure 3. As it is seen, the complexity is the model increases with the lower k value and the model becomes simpler as the k value increases. In other words, If the model became overfitting, model complexity should be decreased, i.e. increase the k value. In contrast if the model became underfitting, model complexity should be increased, i.e. decrease the k value. Model should be generalized by tuning the k value.

Figure 4. Effect of the k values on the model

So far, dataset with 2 inputs(features) is used to understand idea of the knn algorithm. Now let’s consider cancer dataset which has more than 2 inputs. Dataset consists of 569 rows and 30 columns. While the inputs consist of numerical data such as ‘mean radius’, ‘mean texture’, ‘mean perimeter’, ‘mean area’ containing the technical characteristic of the tissue, the output determines whether the tissue is malignant or benign according to inputs. This problem is also called as binary classification since there are 2 outputs. Figure 4 illustrates that effect of the k values on the model. Algorithm accuracies of different k values are shown in the graph. As seen in the graph, the highest accuracy rate on the test set is obtained when k=6 is chosen.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: