Top 3 Ways Your Anomaly Detection Models Can Earn Your Trust

https://miro.medium.com/max/1200/0*61eacZdcCP-K8QFT

Original Source Here

Top 3 Ways Your Anomaly Detection Models Can Earn Your Trust

Connecting anomaly detection models results back to the original signals

Photo by Zdeněk Macháček on Unsplash

In a previous article I exposed a few methods to extract richer insights from anomaly detection models by post-processing their raw results:

After reading through this article and putting it into practice you may have feedback about how trustworthy these insights are:

“I don’t know if I can trust these insights. Why is this model saying this? Show me the data!”

In some cases, the proof will be in the pudding: you will have to deploy the model on live data and investigate the flagged anomalies for a few weeks. This will enable you to take regular reality-checks and to compare the events flagged by the model with the knowledge from your users in the field. This collaboration will gradually build and reinforce the trust for your model.

Some time ago I posted a short presentation on LinkedIn (check this post if you want to get a primer of this article), where I explained the process I follow when trying to tie my anomaly detection model results back to the input time series.

I encourage you to follow along this blog post by browsing to GitHub to grab this series of companion Jupyter notebooks. You can use your usual Jupyter environment or fire up one with Amazon SageMaker. After you have cloned the repo and run the first four notebooks (from the data generation one to the model evaluation part) you can open the last one (synthetic_4_results_deep_dive.ipynb) and follow along this article.

Dataset overview

In this article, I am still using the artificial dataset I generated with some synthetic anomalies. If you want to know more about this dataset, head over to the dataset overview section of my previous article. In a nutshell, this is a 1-year long dataset with 20 time series signals and a regular sampling rate of 10 minutes. When visualizing this data, you will identify some failure times (the red dots below) and some recovering periods (in yellow below):

Synthetic data time series overview (image by author)

In the previous article we also trained and anomaly detection model using Amazon Lookout for Equipment (a managed service running in the AWS cloud). If you want to dive deeper into this service (even if you’re not a developer), I dedicated 6 chapters of my Time series on AWS book to this service:

Going back to the time series

I will assume that you have run the first 4 notebooks to the end and that you have a trained model and your first visualizations to understand its results. To understand better what is happening, you may want to go back to the original time series. Even in a production environment where such a model is already deployed, performing regular error analysis may be an activity you need to conduct with your favorite subject matter experts. After all, these AI models are not replacing your experts and operators: they are merely augmenting them by delivering faster insights.

Time series visualization

Visualizing all your time series (even if there are only 20 of them) may already be challenging. Here is an example of dataset I encountered recently where you can see near 100 time series together:

Multivariate dataset: time series overview (image by author)

Quite the haystack isn’t it? If we assemble the same plot for our 20-sensors simpler synthetic dataset, it’s easier to see what is happening:

Synthetic multivariate dataset overview (image by author)

However, counting on such simplicity in a real-life situation may not be enough… The first thing you could do is to highlight the time ranges where some events were detected by your model:

Highlighting detected anomalous ranges (image by author)

This is slightly better, especially if you have very few time series and the anomalies are easy to spot. However, if you read my previous article, you will remember that our anomaly detection model actually singled out a few sensors. Let’s plot the first one instead of displaying all of them in a single pane of glass:

Now this looks a bit more obvious (image by author)

The signal_07 input was one of time series singled out by our anomaly detection model as the key contributors for these detected events. In the plot above, I highlighted the anomalous time ranges in red: now it’s a lot clearer why the first event was detected. The second event is more interesting: the actual failure is clearly visible, but the model actually detected something wrong almost 2 weeks before this happens.

“What happens when it’s less obvious and I don’t see anything wrong with my signals?”

This is one of the concern you may have, especially when looking at the two-week period before the failure highlighted in the previous plot. It looks like the signal is slightly increasing before the failure, but it’s not this obvious. This is where using histograms to visualize the distribution of the values taken by your time series signal may come handy. Let’s have a look at this…

Visualizing time series values distribution

Let’s focus on the first detected event above. I added a few utility functions in my notebooks to plot two superimposed histograms:

Values distribution for signal_07 around the first anomaly (image by author)

The blue histogram is the distribution of the values taken by signal_07 during the training range, while the red histograms highlights the values taken by the same signal during the anomaly. The change of behavior is even more obvious and you can even enrich these histograms with your domain expertise. For instance, you may know that in normal operation conditions, signal_07 is ranging between 1320 and 1450. In this case, you may want to help your users by adding this information on your histogram:

Highlighting normal operation conditions (image by author)

Let’s have a look at the histogram for the second anomaly:

Values distribution for signal_07 around the second anomaly (image by author)

You can see here the part where the failure happened (the bar around 0). Let’s zoom on the second part, to see what’s happening before the failure:

Values distribution for signal_07 before the failure (image by author)

Here, even though the difference was very slight on the time series, the distribution shift is made much more obvious here.

Conclusion

In this article you learned how to tie your anomaly detection model results back to the original time series. This is very valuable to build trust into your machine learning system and a great tool to collaborate with domain experts and improve your anomaly management process further down the road. Basically, anomaly detection models outputs can help you focus your investigation while appropriate visualization or your original data helps you pinpoint the why!

In future articles I will dive into how you can compute distances between histograms and use this as a proxy for measuring feature importance for anomaly detection models.

I hope you found this article insightful: feel free to leave me a comment here and don’t hesitate to subscribe to my Medium email feed if you don’t want to miss my upcoming posts! Want to support me and future work? Join Medium with my referral link:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: