Top 3 Python Packages for Outlier Detection

https://miro.medium.com/max/1200/0*5by_0CHnmg15IrbK

Original Source Here

1. PyOD

PyOD or Python Outlier Detection is a python package toolkit for detecting outlier data. PyOD package boasts 30 outlier detection algorithms, ranging from the classic to the most latest—proof PyOD package is well maintained. Examples of the outlier detection model include:

  • Angle-Based Outlier Detection
  • Cluster-Based Local Outlier Factor
  • Principal Component Analysis Outlier Detection
  • Variational Auto Encoder

and many more. If you are interested to see all the available methods, you should visit the following page.

PyOD makes outlier detection simple and intuitive by using fewer lines of code to predict the outlier data. Like model training, PyOD uses the classifier model to train the data and predict the outlier based on the model. Let’s try the package with code examples. First, we need to install the package.

pip install pyod

After installing the package, let’s try to load a sample dataset. I would use the tips data from the seaborn package.

import seaborn as sns
import pandas as pd
df = sns.load_dataset('tips')
df.head()
Image by Author

Let’s say we want to find the multivariate outlier between total_bill and tip. We might sense the data spread if we visualize the scatter plot between these two features.

sns.scatterplot(data = df, x = 'total_bill', y = 'tip')
Image by Author

If we see the plot above, we notice some data is located on the top right corner, indicating an outlier. But, what is the limit if we want to classify the data to inlier and outlier? We could use PyOD to help us do the job in this case.

For our example, I would only use two methods — Angle-Based Outlier Detection (ABOD) and Cluster-Based Local Outlier Factor (CBLOF).

from pyod.models.abod import ABOD
from pyod.models.cblof import CBLOF

Let’s start with the ABOD model; we need to set the contamination parameter or the fraction number of outliers detected from our data. If I set the contamination to 0.05, I want to detect 5% of outliers from our data. Let’s try it with our code.

abod_clf = ABOD(contamination=outliers_fraction)
abod_clf.fit(df[['total_bill', 'tip']]))

We fit the data we want to detect the outlier. Similar to the model classifier, we could access the score/label and predict using this classifier.

#Return the classified inlier/outlier
abod_clf.labels_
Image by Author

You could also access the decision score or the probability, but let’s move on with the other model and compare the result.

cblof_clf = CBLOF(contamination=0.05,check_estimator=False, random_state=random_state)
cblof_clf.fit(df[['total_bill', 'tip']])
df['ABOD_Clf'] = abod_clf.labels_
df['CBLOF_Clf'] = cblof_clf.labels_

We store the result on the data frame to compare both detection algorithms.

sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'ABOD_Clf')
ABOD Outlier (Image by Author)

From the ABOD outlier detection result, we could see that the extreme part of the data from the center is considered an outlier. Let’s see from the CBLOF model.

sns.scatterplot(data = df, x = 'total_bill', y = 'tip', hue = 'CBLOF_Clf')
CBLOF Outlier (Image by Author)

Different from ABOD, the CBLOF algorithm classified the outer part as on the one side (right-side). You could try another algorithm to detect the outlier from the data if you want.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: