K-Means Clustering: Concepts and Implementation in R for Data Science



Original Source Here

K-Means Clustering: Concepts and Implementation in R for Data Science

An intro to understanding kmeans with an implemention in R and choosing the best K

Clustering algorithms in Machine Learning are unsupervised techniques (those that have input data without labelled responses). Their objective is to draw data patterns and cluster data observations into different groups based on their similarities. K-Means Clustering is one way of implementing a clustering algorithm that successfully summaries high dimensional data.

K-means clustering partitions a group of observations into a fixed number of clusters that have been initially specified based on their similar characteristics.

Photo by Vino Li on Unsplash

However, the question arises, to group observations:

1) What does it mean for things to be similar to each other?

2) How do we determine things are close enough to be grouped together?

Answering these two questions, deciding the best K, understanding the K-means concept and implementing it on a dataset in R is the scope of this blogpost.

Once we have defined a) the number of clusters we need, b) an initial guess to position our clusters (centroids) and c) a distance metric, we can apply K-means to get a final estimate of cluster centroids and an assignment of each observation of a centroid.

Understanding the algorithm:

For ease of understanding, let’s assume we have a dataset with a total of 10 observations. Looking at the data, we can conclude the data can easily be grouped into 3 different clusters, and so we work with this.

First, we select the numbers of clusters we want to categorize our data into (that’s the K in K-means). Here, let’s decide K = 3 since that’s visually deducible; we will go over a technical method of deciding K in a while.

Sample dataset (image by author)

The next step is to randomly decide three initial distinct data points that act as our clusters or ‘centroids’ on our graph, as denoted by the colored triangles in the graph below. We then measure the distance between ‘1’ data point and the three centroids, and assign it the color of the centroid it is closest to. This will be repeated until all the data points have been assigned to any one centroid.

Choosing random K centroids (image by author)

Next, we calculate the mean of each cluster w.r.t the data points each centroid it has. This mean is now the new location of each centroid and we re-position them on the graph as shown below. The part where we calculated the distances of each point with all the centroids and colored them accordingly will be repeated again until the position of the centroids no longer change. This graph below is what we expect to get once there are no more changes.

Re-positioning centroids to their cluster points (image by author)

This is how K-means splits our dataset into specified number of clusters based on a distance metric. The distance metric we used in in two dimensional plots is the Euclidean distance (square root of (x² + y²)).

Implementing K-means in R:

Step 1: Installing the relevant packages and calling their libraries

install.packages("dplyr")
install.packages("ggplot2")
install.packages("ggfortify")
library("ggplot2")
library("dplyr")
library("ggfortify")

Step 2: Loading and making sense of the dataset

Iris is a built-in dataset that comes in R containing 150 observations of flowers from 3 different types of iris species (Iris setosa, versicolor and virginica). We will be using this for our algorithm testing.

summary(iris)
head(iris)
(image by author)

Step 3: Eliminating the target variable

Since the categorization of observations has already been done in this dataset, we need to remove the target variable from our code as we want our algorithm to be able to that itself. For this, I will load into the first four columns of iris into my data-frame ‘data’.

data <- select(iris, c(1:4))

How to determine what value to use for K?

Step 4: The elbow point technique

While there are many approaches to decide the number of clusters to choose, the Elbow point method (though not very accurate, and we will see why) is widely used. The idea is to assess the quality of clustering by adding up the variation within each cluster (keep track of this and start over again with different starting points), the parameters with the least variance win. The Elbow point method plots a reduction in variation vs. no of clusters (K) and the elbow point, a number for K after which the variation isn’t very steep, is our best K.

We don’t have a built-in function to measure the degree of variance in our observations. However, there is a Rpubs documentation that creates a function of wssplot (Within group Sum of Squares Plot) for us to implement our Elbow point method.

wssplot <- function(data, nc=15, seed=1234){
wss <- (nrow(data)-1)*sum(apply(data,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
wss
}

The plot shows a sharp edge at K = 2, indicating that the optimal number of clusters for our dataset are 2.

WSS Plot (image by author)

Step 5: Implementing K-means

Just as simple as it looks, kmeans() just requires us to input our dataframe and also specify K to function.

kmean <- kmeans(data, 2)
kmean$centers

kmean$clusters will return a vector of numbers ranging from 1 to 2, depicting which observations belong to cluster 1 and cluster 2. kmean$centers return the location of each centroid. For e.g. cluster 1 has a mean value of Sepal.Length = 5.00, Sepal.width = 3.36, Petal.Length = 1.56 and Petal.width = 0.29.

(image by author)

Step 6: Plotting our data-points in clusters

Even though this plot looks great and has grouped our observations into 2 clusters clearly, we already know that our dataset has 3 groups in total. Our Elbow point technique wasn’t entirely accurate in giving us the right K. Hence, as a rule of thumb, it is always better to iterate between the K values around the elbow point and decide the best course of action ourselves.

autoplot(kmean, data, frame = TRUE)
Dataplot after clustering, K = 2 (image by author)

Step 7: Kmeans with K = 3

Since we have decided to change K and see the data patterns, let’s see how the results change.

kmean <- kmeans(data, 3)
kmean$centers
(image by author)

Step 8: Plotting the new clustered graph

We see how kmean$clusters has divided the observations into three clusters now and kmean$centers has also updated the centroid values as well. The plot below shows the grouping based on 3 clusters. Again, K specification is upto us; the K-determining techniques can give us a good estimate to work with.

Dataplot after clustering, K = 3 (image by author)

K-means is an efficient Machine Learning technique that:

  • is easy to implement and apply
  • has great interpretability
  • produces tighter clusters than Hierarchical clustering
  • is computationally fast

However, choosing K manually through an iterative approach, being dependent on initial clusters and inaccuracy of centroid position due to outliers are some of kmeans’ shortcomings. This blogpost focused on explaining the main concepts of kmeans, discussed a technique to decide K value, implemented kmeans in R and highlighted some of its pros and cons.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: