Original Source Here
K-Nearest Neighbor (KNN) Algorithm
“Tell me who your friends are and I will tell you who you are”
As the saying goes — “A person is known by the company he keeps” and it sounds quite intuitive because people in the same company share similar interests. Same in the case of data points too.
Data points that are closer to a particular data point will portray similarity in properties and there is a high possibility that they belong to the same class. Here these close data points are called “neighbors” for that particular data point. So, today we gonna discuss one famous Machine Learning algorithm called “K-Nearest Neighbours” or KNN simply.
We are going to discuss the following things in this article –
- Introduction to KNN
- Algorithm of KNN
- Different Distance functions
- Effect of K
- Decision rule and variations
- Data preprocessing for KNN
- Time and Space complexity analysis
- KNN Optimization
- Pros and Cons of KNN
- KNN implementation from scratch
Let me introduce myself!
“KNN is a supervised, non-parametric and lazy learning algorithm.”
Non-parametric means there is no assumption for underlying data distribution. In other words, the model structure is determined from the dataset. This will be very helpful in practice where most of the real-world datasets do not follow mathematical theoretical assumptions. e.g., in Linear Regression we assume the linear dependency of labels on features.
The lazy algorithm means it does not need any training data points for model generation. All training data are used in the testing phase. This makes training faster and the testing phase slower and costlier. The costly testing phase means time and memory. In the worst case, KNN needs more time to scan all data points, and scanning all data points will require more memory for storing training data.
In K-NN, K is nothing but the number of nearest neighbors to consider while making decisions on the class of test data points. So, without further ado, let’s dive deep into the algorithm!
Algorithm of KNN
Suppose we have to find the class of one unknown point says, query point. First, we will find K closest points by calculating their distances from our query point. Then we classify this point by the majority vote of its K neighbors. Each point votes for its class and the class with the most votes is taken as a prediction and is assigned to this query point. Pictorially,
The strangest thing about this algorithm that makes it different from the rest of ML algorithms is that it doesn’t require any learning — but only a simple distance calculation.
To summarise, the algorithm includes the following basic steps:
- Calculating distances
- Finding K closest neighbors
- Taking the majority vote
The performance of the K-NN algorithm is influenced by three main factors –
- Distance function or distance metric, which is used to determine the nearest neighbors.
- A number of neighbors (K), that is used to classify the new example.
- A Decision rule, that is used to derive a classification from the K-nearest neighbors.
So, let’s dig deep into it and see how these factors affect it.
The simple geometric distance between two points on the Euclidean plane is the Euclidean Distance. Extending it further for multidimensions we treat each feature as a dimension and the distance is calculated as a square root of squared differences between corresponding elements of two vectors. Mathematically it is,
Here, [x1,x2,…,xn] is the feature vector of point x and [y1,y2,…,yn] is of point y.
Diagrammatically, for a 2-D plane,
This distance is also known as taxicab distance or city block distance, that is because of the way this distance is calculated. The distance between two points is the sum of the absolute differences of their Cartesian coordinates. This idea is based on the fact that in a city, hypothetically roads are at right angles and so to go from one place to another you can’t travel through hypotenuse but through the sides leading to following mathematical formula,
Here, [x1,x2,…,xn] is the feature vector of point x and [y1,y2,…,yn] is of point y.
Diagrammatically, for a 2-D plane,
For two vectors, the greatest of their differences along any coordinate dimension is the Chebyshev distance. It is also known as chessboard distance since in the game of chess the minimum number of moves needed by a king to go from one square on a chessboard to another equals the Chebyshev distance between the two squares.
Diagrammatically, for a 2-D plane,
So, for real number, it follows from triangular inequality that,
Distance : Manhattan > Euclidean > Chebyshev
This is the generalized form of the above distance measures. Mathematically,
The p-value in the formula can be manipulated to give us different distances like:
- p = 1, we get Manhattan distance.
- p = 2, we get Euclidean distance.
- p = ∞, we get Chebyshev distance.
The above distances are also termed as L1, L2, and L∞ norms respectively. They all belong to Lebesgue space abbreviated as L(p) space.
This distance metric is used mainly to calculate the similarity between two vectors. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in the same direction. It is used in text analytics to find similarities between two documents by the number of times a particular set of words appear in it.
Mathematically, the cosine distance is:
Diagrammatically on a 2-D plane,
This distance is used for categorical variables where it measures the number of features that differ. For example — Let’s take two strings “golang” and “gopher”. So, their vectors differ in the last 4 positions leading to a hamming distance of 4.
Euclidean is a good distance measure to use if the input variables are similar in type (e.g. all measured widths and heights). Manhattan distance is a good measure to use if the input variables are not similar in type (such as age, gender, height, etc).
As it is a distance-based ML method, it might be a good idea to standardize your data before training. Also, for high-dimensional data, information is lost through equidistant vectors, so dimension reduction is often applied prior to k-NN. We will discuss more in the data preprocessing part.
Effect of K
Generally, we choose such a number as K which is coprime with the number of classes to avoid ties where we get the same number of neighbors for each class. Choosing different values of K results in different outputs. Let’s see this through one example-
If we take the K value as 3 as shown in the above circle the redpoint belongs to class B as class B data points are more. But if the value of K is 6 then redpoint belongs to class A as the majority of K neighbors belong to class A. So, we will loop through the values of K and choose the one which minimizes the error.
Now, here you can see that low values of K have very sharp abrupt boundaries and a greater number of small isolated clusters. As the value of K is very low it could label the query points as the point near to it, and as only small points are taken into consideration, there is a chance for high variance (which is also known as overfitting). In high variance, training error is low but the testing error is very high comparatively.
While high values of K have relatively smoother boundaries and a lesser number of clusters. As the value of K is very high it could mostly label the query point as the class label which is a majority in the given dataset. So, there is a chance for high bias (which is also known as underfitting). In high bias, both training error and testing error is very high.
So, we need to choose the value of K that is neither high nor low and cross-validate it to find a good balance without high bias and variance. Thus, we saw that value of K has a great impact on the bias-variance trade-off.
Decision Rule and variations
After getting labels of K neighbors, we need to make a decision on what should be the label of this query data point. The output can be calculated as the class with the highest frequency from the K-most similar instances. Each instance in essence votes for their class and the class with the most votes is taken as the prediction. So, this is a simple majority voting rule.
When KNN is used for regression problems, the prediction is based on the mean or the median of the K-most similar instances. Median is less prone to outliers than mean.
In the above rule, we are labeling the query point based on majority voting in k nearest neighbors. Here lies a small problem in some cases.
Let’s assume k=5 and among 5 nearest neighbors, there is 2 class A points which are very near to query point say at a distance of 0.1 and 0.2 respectively, and 3 class B points which are at a distance of 0.5, 0.9, and 1.2 respectively. Now according to the above rule, the query point belongs to class B as the majority (3 out of 5) points belong to class B. But our intuition disagrees with it.
So, to resolve this problem, let’s introduce the effect of distance from the query point by giving each neighbor point some weight that is inverse of the distance from that query point. As we know the distance of all the k neighbors from query point, now we calculate the weights in order to say which class it belongs to using the distances we have. The weight for which it belongs to class A is ((1/0.1) + (1/0.2)) = 15 and weight for class B is ((1/0.5) + (1/0.9) + (1/1.2)) = 3.944.
The weights are obtained by taking the sum of inversion of distances from class A and B individually. As a class A weight is more than class B’s, the query point belongs to class A according to the weighted KNN rule. So, this rule also captures the information of distances of neighbors.
As we are reporting only the majority or most probable class, we can also provide probabilities of each class.
Let’s take a simple example that we have K = 5 and among these 5 neighbors, 3 belong to class A and 2 belong to class B. Then we can assign probability as 0.6 for class A and 0.4 for class B. Nonetheless, we can complicate this by assigning weights to the neighbors and then find their probabilities.
Data Preprocessing for KNN
All such distance-based algorithms are affected by the scale of the variables. Consider your data has an age variable which tells about the age of a person in years and an income variable which tells the monthly income of the person in rupees:
Here the age of the person ranges from 25 to 40 whereas the income variable ranges from 50,000 to 110,000. When you calculate the distance you will find that the most deciding factor is Income and not Age because the contribution of Age feature is insignificant compared to that of Income.
KNN performs much better if all of the data has the same scale. So we normalize our data to the range [0, 1]. It may also be a good idea to standardize our data if it has a Gaussian distribution. So after the operation, our data looks like this,
Now we will have equal contributions of both Age and Income in distance calculation.
Below are some of the ways of data rescaling:-
Min-Max Scaling (Normalization):- It is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and
Standard Scaling (Standardization):- It is another scaling technique where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero and the resultant distribution has a unit standard deviation. ( µ is the mean and 𝜎 is the standard deviation).
KNN works well with a small number of input variables but struggles when the number of inputs is very large. Because each input variable can be considered a dimension of p-dimensional input space. For example, if you had two input variables x1 and x2, the input space would be 2-dimensional. As the number of dimensions increases, the volume of the input space increases at an exponential rate. In high dimensions, points that may be similar may have very large distances. All points will be far away from each other and our intuition for distances in simple 2 and 3-dimensional spaces breaks down. This might feel unintuitive at first, but this general problem is called the “Curse of Dimensionality”.
Some commonly used methods for dimensionality reduction are Principal Component Analysis, Feature Selection using Ensemble Trees, High Correlation Filter, etc.
Time and Space Complexity Analysis
The training phase technically does not exist, since all computation is done during prediction, so we have O(1) time complexity.
However, we can reduce the prediction time complexity using appropriate data structures causing faster searches such as K-D Tree, etc.
The computational complexity of KNN increases with the size of the training dataset. For very large training sets, KNN can be made stochastic by taking a sample from the training dataset from which we calculate the K-most similar instances.
As we have seen that KNN occupies a large space and consumes much time to implement we will optimize the performance by using K-Dimensional Tree, also called K-D Tree simply. Basically, it is a binary search tree where data at each node is a K-Dimensional point in space. In short, it is a space partitioning data structure for organizing points in a K-Dimensional space.
The K-D tree is guaranteed to have a depth of order log2(n) where n is the number of points in the set. Traditionally, k-d trees store points in d-dimensional space which are equivalent to vectors in d-dimensional space.
Pick X-axis and project each point onto the x-axis. Compute the median and split the data using the median. Then, pick Y-axis, project each point onto the y-axis, compute the median, and split the data using the median. Repeat the above steps and change the axis alternatively and build a tree. A non-leaf node in K-D Tree divides the space into two parts, called half-spaces. Points to the left of this space are represented by the left subtree of that node and points to the right of the space are represented by the right subtree.
If there is just one point, form a leaf with that point. Otherwise, divide the points in half by a line perpendicular to one of the axes. Recursively construct k-d trees for the two sets of points. Divide points perpendicular to the axis with the widest spread.
Search recursively to find the point in the same cell as the query. On the return, search each subtree where a closer point than the one you already know about, might be found.
Keep the variable of the closest point to the query point found. Prune subtrees once their bounding boxes say that they can’t contain any point closer to the query point. Search the subtrees that maximize the chance for pruning.
K-D Tree runs in O(log(n)) average time per search in a reasonable model. Storage for the K-D tree is O(n). Preprocessing time is O(n*log(n)) assuming D is small.
Limitations of K-D Tree are –
1. When d is not small, time complexity increases drastically.
2. K-D Tree works well only on data that is uniformly distributed, but most real-world data are not uniformly distributed.
3. K-D Tree is not invented for K-NN, it is mainly invented for information retrieval and computer graphics.
Locality Sensitive Hashing (LSH)
It’s a beautiful technique to find similar points that are nearest neighbors. It uses the concept of Hashing for this. You might be familiar with how the Hash Table is constructed. It’s a very efficient data structure that allows us to perform operations in O(1) time.
LSH is a hashing-based algorithm to identify the approximate nearest neighbors. In the normal nearest neighbor problem, there are a bunch of points in space, and given a new point, the objective is to identify the point in the training set closest to the given point.
Locality Sensitive Hashing is a set of techniques that dramatically speed up search-for-neighbors or near-duplicates detection on data. These techniques can be used to filter out duplicates of scraped web pages at an impressive speed, or to perform near-constant-time lookups of nearby points from a geospatial data set.
Hash functions typically have these key properties:
1. They map some type of input, such as strings or floats, to discrete values, such as integers.
2. They’re designed so that two inputs will result in hash outputs that are either different or the same based on the key properties of the inputs.
We apply a hash function to each available data point. This gives us the bucket or the key where the point can be placed. We try to maximize collisions so that similar points go to the same bucket. Now if we get an unlabelled query point and apply the hash function then it will go to the same bucket where similar points exist. This makes it easy to find the nearest neighbors for the query point. And subsequently, we can apply k-NN to predict its class.
Pros and Cons of KNN
- Very easy algorithm with simple underlying mathematics to implement.
- No training time is required.
- No need to build a model, tune several parameters or make additional assumptions.
- New data can be added seamlessly which will not impact the accuracy of the algorithm.
- The algorithm is versatile i.e. it can be used for both classification and regression problems.
- Does not work well with large datasets because the cost of calculating the distance between the new point and each existing point is huge which degrades the performance of the algorithm.
- Does not work well with high dimensions because it becomes difficult for the algorithm to calculate the distance in each dimension.
- We need to do feature scaling before applying the KNN algorithm.
KNN implementation from scratch
Now, it’s time to code the algorithm from scratch.
So, in this article, we comprehensively discussed the KNN algorithm, its underlying mathematics of distance calculation. It is noticeable that this is a unique algorithm that doesn’t require training. Also, it is a robust algorithm without any underlying assumptions. We can use it both for classification and regression. Although it has a fairly high time complexity, we can optimize the algorithm by either storing it in K-D Tree or by applying Locality Sensitive Hashing. This was all about KNN! Hope you had a good time reading it.
So, revise, code, and watch out for our next article!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot