Original Source Here
Which London borough is the safest to live in? KMeans Clustering Analysis
Dr Dilek Celik, PhD — Data Scientist https://www.linkedin.com/in/drdilekcelik/
Ahmet Celik, MA — Data Scientist https://www.linkedin.com/in/ahmetclk/
The aim of this analysis is to cluster the London boroughs by crime rates in four category: high, upper moderate, lower moderate, low. An unsupervised algorithm, Kmeans, is used in the clustering.
We would like to have only borough as index and major categories as the columns. So, lets get this data.
Exploratory Data Analysis
Lets explore the data.
By looking at describe and and below graph, we can interpret if there is an outlier or not. From the below graph, we can depict that there are some 0 values that can be considered as outliers. However, as the aim is to identify safest borough of London, 0 indicates that there is not a crime at these boroughs. Therefore, these values will remain as they are.
To remove outliers IQR and Z-score can be used. We will demostrate how these outliers might be removed. I will be using the dataset with IQR removal in this analysis.
The K-means algorithm definitely needs scaling. However, if all our data has the same unit of measure, there is no need to scale the data. For these data, scaling may cause worse results in some cases. In this analysis, we did not scale the data.
Adding data to X value.
Hopkins Test tells us about whether the data is suitable for clustering or not. 0.15 value tells that our data is suitable for clustering.
Lets explore this elbow graph as bar graph.
Another way of deciding on number of cluster is using yellowbrick library. According to below graph, the best cluster for this dataset is 4.
Silhouette analysis is also another method we can use when deciding on number of clusters. Silhoutte analysis scores for cluster 3 and 4 is too close to each other.
After using all the methods (Elbow method, yellowbrick, and silhoutte analysis), we can say that the best cluster for this dataset is 4. Now, lets buld the Kmeans model based on optimal cluster number.
Building the model based on the optimal number of clusters
Visualization Clusters by Features
Visualisation of the Centers of the Cluster
Remodeling according to discriminating features
We remodelled according to discriminating features. The below graph shows discriminating features. In the new model, Burglary, Criminal Damage, Robbery, Theft and Handling, Violance Against the Person are included. We followed all the steps above mentioned to train the model with these discriminating features. These steps will not be shared.
Explore Clusters Against Features
Prediction cluster of new data
Congratulations! You reached to the end of this article..
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot