All multivariate time series are not born equal

https://cdn-images-1.medium.com/max/2600/0*v3HWiCCQzfTKv2bw

Original Source Here

At each process step, the production line collects many parameters:

  • Some time series are coming from sensors linked to each piece of equipment: they are used to monitor each equipment themselves. Others might be linked to environmental conditions when they can impact the manufacturing process (e.g. atmosphere hygrometry or external temperature)
  • Other time series will come from the process monitoring itself: for instance, the temperature of the paper pulp at the beginning of the process above can be monitored and measured. This is called a process variable.
  • Each process variable is associated to a desired set point: at any given point of the process, the line operators applies a certain recipe to ensure the quality of the finished goods matches the desired requirements (e.g.: the desired temperature of the paper pulp)
  • In addition, other data are needed to fully define the process behavior and the finished goods characteristics: raw material characteristics or supplier, quality inspection or lab results obtained at different quality gate of the process…

A paper mill such as the one operated by paper manufacturers can collect several thousands of time series at a high sampling rate (sub-second level). To ensure the highest throughput as possible, the production lines must have as high a utilization rate as possible.

Data are continuously recorded across several batches of product (hence the “discrete” qualifier used in the section title above) and each batch is more or less independent from the other: clean in place process might be needed between batches, the equipment or consumables have more operating hours, some pieces might have wore down a little bit more, some maintenance operations might have taken place… The context of a given batch can never be exactly reset to match the context of the previous one.

Other example processes that produces similar multivariate time series can be:

  • Different aircraft flights with different logged actions, different pilots, varying aircraft make and model…
  • ECG from different patients of the same cohort or not
  • Signals associated to successive cars assembled in an automotive factory

To perform forecast analysis or anomaly detection on these multivariate time series, you need a model that will indeed try to find temporal correlation between all these signals. However, compared to the previous case (multiple independent univariate signals), correlation won’t be enough as the models also has to try and learn the physics nonlinear relationships these different signals may have in common.

This is true both at the modelling phase and at the exploration phase. One of the key challenges you can have is to accurately slice your timeseries to obtain the right sequences:

  • For a manufacturing production line, you can look for a discrete signals that could be used to record production batch numbers (a discrete signal with continuously increasing value might be a good hint)
  • A rotating machine can have a signal with a unit of measure in RPM: when this value set to 0 or close enough, you can consider these periods as “off-state” for the piece of equipment. You can also cluster the values of such signal to find different operating conditions.

Another challenges is the presence of discrete (analog signals) that only take discrete values: they could be Boolean (0 or 1 given the state of a machine) or integer values (each value representing a given state of an actuator or parameter setpoints).

You will also come upon other challenges linked to the way the data are stored and exported from local collection systems:

  • Proprietary binary files format (ibaPDA is a well known data acquisition system that generates highly compressed binary files when exported: you will then need a specific wrapper running around a Windows DLL library to read you extracted data)
  • For space saving purpose, only changing values are stored for each time series: when uncompressed to align every time series signals, and combined to binary formats as just mentioned, you sometime need to take care of how to deal with the massive amount of data generated: I once had 1 TB of data collected for a year, that yielded 30 TB of raw data on disk: at this point, Panda stops being your best friend…
  • Given that only changing values are stored in most industrial data acquisition system (like ibaPDA or any historian like GE Proficy Historian or OSIsoft PI System), the extraction function usually defaults to automatic resampling your data before the export. If you want to analyze the quality of your PID controllers or if you know how fast the phenomena you want to capture is, you’ll need to make sure to extract actual raw data as they are recorded (otherwise, you risk smoothing out what you’re precisely trying to uncover).

Multivariate time series exploration challenges

Depending on the family of multivariate time series you wish to analyze, you will have to devise a highly different strategy.

Multiple independent signals

Most of the time, the battery of packages, approaches, books and models that can be applicable to univariate time series can be successively applied to each available signal you have pertaining to your problem.

  • With a low number of time series signals, you can perform covariance and cointegration analysis to understand the potential relationships between each signal.
  • If you have a high number of signals at your disposal, you might also be able to extract new features from each signal and perform some unsupervised analysis (e.g. applying time series features engineering techniques and use clustering to identify families of time series).

From a visualization and mining perspective, you might find some good insights with density line charts or multiscale temporal patterns mining with Pinus views. In addition, leveraging Markov transition fields and network graphs as I exposed in this article, is a great tool to objectively understand and classify time series behavior:

Multivariate timeseries event

Whether your time series data are used to qualify continuous or discrete processes, some exploration challenges will be quite common:

  • Ability to plot a high number of time series plots with the ability to zoom on different time ranges (for continuous data available over several years) or ability to highlight different sequences and links between them (for discrete processes like flight tests or production batches): using Bokeh, Highcharts or Plotly in a Jupyter Notebook will only get you so far without a realtime ingestion pipeline which will allow you to explore your data without any latency.
  • Build an understanding on how to group different signals together for exploration purpose (analog vs. digital signals, clustering based on extracted features including the ones mentioned
  • Labelled industrial data useful for anomaly prediction or predictive maintenance are very rare: some events are very difficult to catch and document in real time (if you have a production incident, you strive to solve it first, before recording the precise timestamp when it’s happening…). Having the ability to label your timeseries not only for classification but to dig right into them to label actual ranges when you see potential anomalous behavior is paramount to build high quality datasets that will later allow moving away from only unsupervised approaches.

Conclusion and future works

This article focused on a few definition pertaining to multivariate time series and pointed out some common challenges not encountered in univariate context.

In the future, the work I’m interested into investigating is how to build a process or a system that will allow comfortable exploration of high dimensional time series data: from loading the right time range out of terabytes of data, visualizing hundreds of parallel time series, labelling and classifying them, regrouping them logically for uncovering potential correlations… There is a lot to explore and beginning with a repeatable time series profiling procedure looks like a good starting point.

Once a suitable exploration framework is available I will move forward into building high quality datasets and benchmark datasets that will allow comparing and building time series oriented model zoos for classification, outlier detection, anomaly prediction, pattern learning…

Bonus: a few Python packages of interest!

Beyond Pandas, you will find below some python packages I like to work with when dealing with univariate or multivariate time series data:

  • tsfresh: this module can be used to extract characteristics from time series. It can compute up to 1,200 features for each time series provided.
  • tsfuse: this package automatically constructs features from multiple time series. Instead of extracting univariate time series features, TSFuse generates new series by applying time series fusion operations.
  • pyts: this is a Python package dedicated to time series classification. However, it also provides several preprocessing and utility tools for time series.
  • tsia: a simple package I started to put together to help uncover patterns in time series thanks to imaging techniques like network graphs and Markov transition fields. This package is still in its infancy, feel free to send some suggestions my way to enrich it.
  • fancyimpute: this package implements a variety of matrix completion and imputation algorithms (e.g. using KNN or matrix factorization).

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: