Which London borough is the safest to live in? KMeans Clustering Analysis

Original Source Here

Which London borough is the safest to live in? KMeans Clustering Analysis


Dr Dilek Celik, PhD — Data Scientist https://www.linkedin.com/in/drdilekcelik/

Ahmet Celik, MA — Data Scientist https://www.linkedin.com/in/ahmetclk/


The aim of this analysis is to cluster the London boroughs by crime rates in four category: high, upper moderate, lower moderate, low. An unsupervised algorithm, Kmeans, is used in the clustering.

Import Libraries


We would like to have only borough as index and major categories as the columns. So, lets get this data.

Exploratory Data Analysis

Lets explore the data.

By looking at describe and and below graph, we can interpret if there is an outlier or not. From the below graph, we can depict that there are some 0 values that can be considered as outliers. However, as the aim is to identify safest borough of London, 0 indicates that there is not a crime at these boroughs. Therefore, these values will remain as they are.

To remove outliers IQR and Z-score can be used. We will demostrate how these outliers might be removed. I will be using the dataset with IQR removal in this analysis.

Scaling Data

The K-means algorithm definitely needs scaling. However, if all our data has the same unit of measure, there is no need to scale the data. For these data, scaling may cause worse results in some cases. In this analysis, we did not scale the data.

K_Means Clustering

Adding data to X value.

Hopkins Test tells us about whether the data is suitable for clustering or not. 0.15 value tells that our data is suitable for clustering.

Lets explore this elbow graph as bar graph.

Another way of deciding on number of cluster is using yellowbrick library. According to below graph, the best cluster for this dataset is 4.

Silhouette analysis is also another method we can use when deciding on number of clusters. Silhoutte analysis scores for cluster 3 and 4 is too close to each other.

After using all the methods (Elbow method, yellowbrick, and silhoutte analysis), we can say that the best cluster for this dataset is 4. Now, lets buld the Kmeans model based on optimal cluster number.

Building the model based on the optimal number of clusters

Visualization Clusters by Features

Visualisation of the Centers of the Cluster

Remodeling according to discriminating features

We remodelled according to discriminating features. The below graph shows discriminating features. In the new model, Burglary, Criminal Damage, Robbery, Theft and Handling, Violance Against the Person are included. We followed all the steps above mentioned to train the model with these discriminating features. These steps will not be shared.

Visualise Clusters


Explore Clusters Against Features

Prediction cluster of new data

Congratulations! You reached to the end of this article..


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: