Clustering Geolocation Data in Python using DBSCAN and K-Means



Original Source Here

Clustering Geolocation Data in Python using DBSCAN and K-Means

Pic credits : springer

Clustering is a technique of dividing the population or data points, grouping them into different clusters on the basis of similarity and dissimilarity between them. It’s helps in determining the intrinsic group among the unlabeled data points.

Pic credits : Springer

Applications of Clustering —

  1. Geolocation Data Clustering
  2. Market Segmentation — helps in grouping people who have same purchasing behaviour, discover new customer segments for marketing etc
  3. News — To group related news together
  4. Search Engines — To group similar results
  5. Social Network Analysis
  6. Image Segmentation
  7. Anomaly detection
  8. Insurance fraud cases etc

K-means clustering is a unsupervised ML technique which groups the unlabeled dataset into different clusters, used in clustering problems and can be summarized as —

i. Divide into number of cluster K

ii. Find the centroid of the current partition

iii. Calculate the distance each points to Centroids

iv. Group based on minimum distance

v. After re-grouping/re-allotting the points, find the new centroid of the new cluster

Pic credits : Pinterest

In this project we will be using Taxi dataset ( can be downloaded from Kaggle) and perform clustering Geolocation Data using K-Means and demostrate how to use DBSCAN Density-Based Spatial Clustering of Applications with Noise (DBSCAN) which discovers clusters of different shapes and sizes from data containing noise and outliers and HDBSCAN — Hierarchical Density-Based Spatial Clustering of Applications with Noise which performs DBSCAN over varying epsilon values and integrates the result to find a clustering that gives the best stability over epsilon.

pip install hdbscan

Import libraries and Load the data

from collections import defaultdict
from ipywidgets import interactive
import hdbscan
import folium
import re
import matplotlib
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import pandas as pd
import numpy as np
from tqdm import tqdmfrom sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score
from sklearn.datasets import make_blobs
from sklearn.neighbors import KNeighborsClassifier

Specify the cols —

cols = ['#e6194b', '#3cb44b', '#ffe119', '#4363d8', '#f58231', '#911eb4','#46f0f0', '#f032e6', '#bcf60c', '#fabebe', '#008080', '#e6beff', '#9a6324', '#fffac8', '#800000', '#aaffc3', '#808000', '#ffd8b1', '#000075', '#808080']*10

Perform Exploratory Data Analysis ( EDA)

df= pd.read_csv('Path to data file')
df.head()

Output —

df.duplicated(subset=['LON', 'LAT']).values.any()

Output —

True

To find nulls

df.isna().values.any()

Output —

True

Drop the Null and duplicates

print(f'Before (Nulls and Duplicates) \t:\tdf.shape = {df.shape}')
df.dropna(inplace=True)
df.drop_duplicates(subset=['LON','LAT'],keep ='first', inplace=True)
print(f'After (Nulls and Duplicates) \t:\tdf.shape = {df.shape}')

Output —

Before (Nulls and Duplicates)	:	df.shape = (838, 3)
After (Nulls and Duplicates) : df.shape = (823, 3)

After dropping all the null and duplicates

df.head()

Plot the points —

X= np.array(df[['LON','LAT']],dtype='float64')
plt.scatter(X[:,0],X[:,1],alpha =0.2,s=50)

Using Folium Visualize Geographical Data

Folium

Folium makes it easy to visualize data that’s been manipulated in Python on an interactive leaflet map. It enables both the binding of data to a map for choropleth visualizations as well as passing rich vector/raster/HTML visualizations as markers on the map.

The library has a number of built-in tilesets from OpenStreetMap, Mapbox, and Stamen, and supports custom tilesets with Mapbox or Cloudmade API keys. folium supports both Image, Video, GeoJSON and TopoJSON overlays.

$ pip install folium

m= folium.Map(location=[df.LAT.mean(),df.LON.mean()],zoom_start=9,
tiles='Stamen Toner')
for _, row in df.iterrows():
folium.CircleMarker(
location=[row.LAT,row.LON],
radius=5,
popup = re.sub(r'[^a-zA-Z]+','',row.NAME),
color='#1787FE',
fill=True,
fill_color='#1787FE').add_to(m)
m

Output —

Clustering Strength

X_blobs,_=make_blobs(n_samples=1000,centers=10,
n_features=2,cluster_std=0.5,random_state=4)
plt.scatter(X_blobs[:,0],X_blobs[:,1],alpha=0.2)

Output —

class_predictions = np.load('/Users/priyeshkucchu/Desktop/sample_clusters.npy')
unique_clusters = np.unique(class_predictions)
for unique_clusters in unique_clusters:
X=X_blobs[class_predictions==unique_clusters]
plt.scatter(X[:,0],X[:,1],alpha=0.2,c=cols[unique_clusters])

Output —

silhouette_score(X_blobs,class_predictions)

Output —

0.6657220862867241class_predictions = np.load('Data/sample_clusters_improved.npy')
unique_clusters = np.unique(class_predictions)
for unique_clusters in unique_clusters:
X=X_blobs[class_predictions==unique_clusters]
plt.scatter(X[:,0],X[:,1],alpha=0.2,c=cols[unique_clusters])

Output —

silhouette_score(X_blobs,class_predictions)

Output —

0.7473587799908298

K-Means Clustering

X_blobs, _ = make_blobs(n_samples=1000, centers=50, 
n_features=2, cluster_std=1, random_state=4)
data = defaultdict(dict)
for x in range(1,21):
model = KMeans(n_clusters=3, random_state=17,
max_iter=x, n_init=1).fit(X_blobs)

data[x]['class_predictions'] = model.predict(X_blobs)
data[x]['centroids'] = model.cluster_centers_
data[x]['unique_classes'] = np.unique(class_predictions)
def f(x):
class_predictions = data[x]['class_predictions']
centroids = data[x]['centroids']
unique_classes = data[x]['unique_classes']
for unique_class in unique_classes:
plt.scatter(X_blobs[class_predictions==unique_class][:,0], X_blobs[class_predictions==unique_class][:,1], alpha=0.3, c=cols[unique_class])
plt.scatter(centroids[:,0], centroids[:,1], s=200, c='#000000', marker='v')
plt.ylim([-15,15]); plt.xlim([-15,15])
plt.title('How K-Means Clusters')
interactive_plot = interactive(f, x=(1, 20))
output = interactive_plot.children[-1]
output.layout.height = '350px'
interactive_plot
X=np.array(df[['LON','LAT']],dtype='float64')
k=70
model = KMeans(n_clusters=k,random_state=17).fit(X)
class_predictions=model.predict(X)
df[f'CLUSTER_kmeans{k}'] = class_predictions
df.head()

Output —

def create_map(df,cluster_column):
m = folium.Map(location=[df.LAT.mean(), df.LON.mean()], zoom_start=9, tiles='Stamen Toner')
for _, row in df.iterrows():# get a colour
if row[cluster_column]==-1:
cluster_colour= '#000000'
else:
cluster_colour = cols[row[cluster_column]]
folium.CircleMarker(
location= [row['LAT'],row['LON']],# insert here,
radius=5,
popup= row[cluster_column],# insert here,
color=cluster_colour,
fill=True,
fill_color=cluster_colour
).add_to(m)
return m
m=create_map(df,'CLUSTER_kmeans70')
print(f'K={k}')
print(f'Silhouette Score: {silhouette_score(X, class_predictions)}')

Output —

K=70
Silhouette Score: 0.6367300948961482

Display the map

m
best_silhouette, best_k = -1, 0for k in tqdm(range(2, 100)):
model = KMeans(n_clusters=k, random_state=1).fit(X)
class_predictions = model.predict(X)

curr_silhouette = silhouette_score(X, class_predictions)
if curr_silhouette > best_silhouette:
best_k = k
best_silhouette = curr_silhouette

print(f'K={best_k}')
print(f'Silhouette Score: {best_silhouette}')

Output —

100%|██████████| 98/98 [00:34<00:00,  2.83it/s]K=98
Silhouette Score: 0.6971995093340411

DBSCAN

The DBSCAN uses two parameters:

  • Epsilon — eps (ε): It’s the distance measure that will be used to locate the points in the neighborhood of any point
  • minsamples: It’s the minimum number of samples/points clustered together for a region to be considered dense
dummy = np.array([-1, -1, -1, 2, 3, 4, 5])
new = np.array([(counter+2)*x if x==-1 else x for counter, x in enumerate(dummy)])
model = DBSCAN(eps=0.01,min_samples=5).fit(X)
class_predictions= model.labels_
df['CLUSTER_DBSCAN'] = class_predictions

Create Map

m = create_map(df,'CLUSTER_DBSCAN')print(f'Number of clusters found: {len(np.unique(class_predictions))}')
print(f'Number of outliers found: {len(class_predictions[class_predictions==-1])}')
print(f'Silhouette ignoring outliers: {silhouette_score(X[class_predictions!=-1], class_predictions[class_predictions!=-1])}')no_outliers = 0
no_outliers = np.array([(counter+2)*x if x==-1 else x for counter, x in enumerate(class_predictions)])
print(f'Silhouette outliers as singletons: {silhouette_score(X, no_outliers)}')

Output —

Number of clusters found: 51
Number of outliers found: 289
Silhouette ignoring outliers: 0.9232138250288208
Silhouette outliers as singletons: 0.5667489350583482

Display the map

m

HDBSCAN

Hierarchical Density-Based Spatial Clustering of Applications with Noise is equipped with the visualization tools to help you understand your clustering results

model=hdbscan.HDBSCAN(min_cluster_size=5, min_samples=2,
cluster_selection_epsilon=0.01)
class_predictions=model.fit_predict(X)
df['CLUSTER_HDBSCAN'] = class_predictions

Create the map

m=create_map(df,'CLUSTER_HDBSCAN')
print(f'Number of clusters found: {len(np.unique(class_predictions))-1}')
print(f'Number of outliers found: {len(class_predictions[class_predictions==-1])}')
print(f'Silhouette ignoring outliers: {silhouette_score(X[class_predictions!=-1], class_predictions[class_predictions!=-1])}')no_outliers = np.array([(counter+2)*x if x==-1 else x for counter, x in enumerate(class_predictions)])
print(f'Silhouette outliers as singletons: {silhouette_score(X, no_outliers)}')
m

Output —

Number of clusters found: 66
Number of outliers found: 102
Silhouette ignoring outliers: 0.7670504356844786
Silhouette outliers as singletons: 0.638992483305273

Addressing Outliers

classifier=KNeighborsClassifier(n_neighbors=1)
df_train=df[df.CLUSTER_HDBSCAN!=-1]
df_predict=df[df.CLUSTER_HDBSCAN ==-1]
x_train = np.array(df_train[['LON','LAT']],dtype='float64')
y_train = np.array(df_train['CLUSTER_HDBSCAN'])
X_predict = np.array(df_predict[['LON','LAT']],dtype='float64')
classifier.fit(x_train,y_train)
predictions = classifier.predict(X_predict)
df['CLUSTER_hybrid'] = df['CLUSTER_HDBSCAN']
df.loc[df.CLUSTER_HDBSCAN==-1,'CLUSTER_hybrid'] = predictions
m= create_map(df,'CLUSTER_hybrid')
m
class_predictions=df.CLUSTER_hybrid
print(f'Number of clusters found: {len(np.unique(class_predictions))}')
print(f'Silhouette: {silhouette_score(X, class_predictions)}')
m.save('hybrid.html')

Output —

Number of clusters found: 66
Silhouette: 0.5849126494706486

Plot KMeans vs Hybrid Clusters —

df['CLUSTER_hybrid'].value_counts().plot.hist(bins=70,
alpha=0.4,label='Hybrid')
df['CLUSTER_kmeans70'].value_counts().plot.hist(bins=70,
alpha=0.4,label='K-Means(70)')
plt.legend()
plt.xlabel('Cluster Sizes')

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: