Tricky Way of Using Dimensionality Reduction For Outlier Detection in Python

Original Source Here

Setting a baseline with pure Isolation Forest

Before we move on, let’s establish a baseline performance. First, we will fit a CatBoostClassifier for a benchmark:

We have got a ROC AUC score of 0.784. Now, let’s fit an Isolation Forest estimator to the data after imputing missing data:

Even though powerful, Isolation Forest has only a few parameters to tune. The most important one is n_estimators, which controls the number of trees to be built. We are setting it to 3000, considering the dataset size.

After waiting for about 50 minutes, we discover that Isolation Forest found 2270 outliers in the data. Let’s drop those values from the training data and fit a classifier again:

We got a slight drop in the performance, but that’s not a sign for us to stop our efforts of finding outliers. At this point, we don’t really know whether these ~2200 data points are all the outliers in the data. We don’t know whether we built Isolation Forest with enough trees, as well. The worst part, we can’t try the experiment again because it is too time-consuming.

That’s why we will start playing smart.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: