Stacking Machine Learning Models for Multivariate Time Series

Original Source Here

Wrangling through Dataland

Stacking Machine Learning Models for Multivariate Time Series

A case study on the PM 2.5 air pollution dataset

Beijing Smog by PublicDomainPictures from Pixabay

Time series analysis is all too often seen as an esoteric sub-field of data science. It is not. Other data science sub-fields have their idiosyncrasies (e.g. NLP, recommender systems, graph theory etc.), and it is the same with time series. Time series is idiosyncratic, not distinct.

If your goal is prediction, you likely do not need the classical econometric models — ARIMA, ARDL, VAR — and their assumptions, including stationarity, depending on your requirements, methodology and data. In fact, some eminent econometricians have long argued that stationarising multivariate time series data strips them of useful dynamic trends and relationships, and thus you might be discarding valuable information by doing so.

There is an issue of whether the variables in a VAR (Vector Auto-Regression) need to be stationary. Sims (1980) and Sims, Stock and Watson (1990) recommend against differencing even if the variables contain a unit root. They argued that the goal of a VAR analysis is to determine the inter-relationships among the variables, not to determine the parameter estimates. The main argument against differencing is that it “throws away” information concerning the co-movements in the data (such as the possibility of cointegrating relationships). Similarly, it is argued that the data need not be detrended. — Walter Enders, Applied Econometric Time Series, 3rd edition

Let me demonstrate how machine learning models are well-suited for time series forecasting, and I will make it more interesting by stacking an ensemble of machine learning models. You do have to adjust the cross-validation procedure to respect a time series’ temporal order, but the general methodology is the same. Again, this is if your goal is prediction, and you have zero interest in hypothesis testing and statistical inference.

The analogy from cross-sectional data analysis is obvious. If you wish to undertake statistical inference then you will likely need a linear regression model and (largely) obey the Gauss-Markov assumptions. But if you do not require hypothesis testing, then you may use random forests or SVMs or neural networks, and pay absolutely no heed to residual plots and p-values.

The dataset in question is pollution- and weather-related, with the goal of forecasting hourly 2.5 micron particulate matter (“PM 2.5”) concentrations in the air. It is a continuous variable. “PM 2.5” is the finest category of particulate matter air pollution, and poses significant health risks because the particulates are so fine that they may bypass most of the body’s natural defences, and enter deep into the lungs when inhaled. The data is freely available at the UCI repository here.[1]

[1] Song Xi Chen (2016). UCI Machine Learning Repository []; Irvine, CA: University of California, School of Information and Computer Science.

Stack Ensemble Modelling

I will populate the stack ensemble with some of the greatest hits of machine learning algorithms, but will not include any econometric time series models. The first stage of the stack will comprise the following base models:

  • Lasso Regression (Lasso)
  • Multi-Layer Perceptron (MLP), an artificial neural network
  • Linear Support Vector Regression (SVR)
  • Support Vector Machine (SVM) — restricted to either rbf, sigmoid or poly kernels
  • Random Forest Regressor (RF)
  • XG Boost Regressor (XGB)

The second (and final) stage of the stack is a single meta model represented by the all-time favourite Linear Regression (“OLS”) model. A simple diagram representing my methodology is below.

Stack model diagram, with data split three ways from earliest batch at bottom to latest test set at top (image by author)

There is no “right” way to do ensemble stack modelling. It is a mostly a matter of practical experience leavened with a large dose of trying and testing. A fairly typical approach is to use several of each machine learning algorithm set to different hyper-parameters in the first stage, and their predictions are then fed to a meta model that is trained on these predictions and the target variable.

More complex setups may involve additional layers of models preceding the meta model. A blog post by the data analytics company SAS explains the general idea of stacking here.

If there are any principles to stack ensemble modelling, there are three that come to mind:

  1. Assemble a mix of algorithms that can provide decent predictions of the particular data of interest, but using different methodologies. For example, I utilise a mix of linear models, tree-based models, support vector models and neural networks in this ensemble.
  2. Over-fitting is often an issue, hence the importance of cross-validation in assessing the stack model.
  3. Be meticulous about segregating the various training, validation and test data layers to prevent them from “bleeding” into the next layer.

Target Variable

The dataset contains hourly data over five years, from 1 January 2010 through 31 December 2014, of Beijing’s PM 2.5 particulate readings along with selected weather variables such as temperature, wind direction, and rainfall among others. There are 43,824 rows in the original csv file. The target variable of this exercise will be the one-period-ahead (i.e. one-hour ahead) PM 2.5 reading.

The line chart of the target variable suggests strong seasonalities in the data, but no obvious multi-year trend. The box-plots of the distributions of the PM 2.5 readings by year also indicate the lack of a trend, though the data is bounded at zero and appears to be plagued by significant numbers of “outliers” at higher values. I suppose one cannot have negative levels of airborne particulates.

However, some of these “outliers” might not be outliers at all, in that there could be some pattern to certain elevated incidences of air pollution. This is where domain knowledge and/or plain intellectual curiosity becomes important. Let’s begin by taking a closer look at the data.

Top 15 PM 2.5 readings

The table on the left lists the top fifteen hourly PM 2.5 readings in the dataset. It might not be immediately obvious, but the top three readings are from the first hours of the Lunar New Year (LNY) in 2010 and 2012. The lunar calendar is different from the Gregorian (solar) calendar, in that the LNY may fall on different Gregorian days of the year. The new year is traditionally ushered in by setting off firecrackers. Firecrackers generate a lot of smoke and airborne debris, and generally contribute to air pollution. Therefore, one needs to account for the hours around the midnight of LNY in the model. I wonder if an AutoML would be able to identify this?

Beijing is also prone to occasional sandstorms in the early spring, and there was a major sandstorm on 22 March 2010, which contributed to two of the listings on the top 15 list above. Therefore, the model will also need to account for incidences of major sandstorms. Sandstorms may be forecasted ahead of time, though perhaps not very far ahead, and so could certainly serve as an explanatory variable for hour-ahead PM 2.5 forecasts.

In the course of the explanatory analysis, I also found that PM 2.5 readings over weekend hours as a group were higher than those during weekday hours. The t-test of differences in means returned a t-statistic of 4.6551, implying that the difference is significant at the 99% level.


Moreover, there are complex seasonalities in the data. The plots of PM 2.5 readings by year reveal notable peaks in the beginning and towards the end of the year, basically the colder months. This is because much of Beijing’s pollution is generated from indoor heating purposes, and there will obviously be more heating when it’s cold. This may also explain why PM 2.5 readings tend to be higher in the weekends.

Hourly PM 2.5 readings in 2010–2014

The seasonalities become more obvious when we look at the average PM 2.5 reading by month, along with their confidence intervals. The monthly data confirms the seasonality posited above with lower levels recorded during the warmer months, and higher levels during the colder months with particular peaks in February and October. The Lunar New Year also tends to fall sometime between late January through late February by the way.

Drilling deeper into the data, there is also obvious intraday seasonality. In fact, the daily seasonality is almost as important as the monthly one in this case. The average hourly PM 2.5 readings start at the highest levels of the day at around midnight, then decline gradually to a bottom in the afternoon at 15:00 before rising quickly into the night. This is probably the heating effect as there is less need for heating during daytime when the sun is up.

What does all this data analysis suggest? I need dummy variables for the following:

  • The seven hours from 9pm through 3am of each year’s LNY;
  • The days of a major sandstorm;
  • Weekends;
  • Months; and
  • Hours.

My determination of a “major sandstorm” is very straightforward. Just Google “Beijing sandstorm 201x” for each year in the dataset, and whatever the date of the listings that turn up on the first page of the search results will be recorded as a “major sandstorm”.

But before undertaking the necessary data engineering, however, I first need to deal with the pressing issue of missing values.

Missing Target Values

One major problem with the data is that the missing values are entirely in the target variable. Almost 5% of the target observations are NaNs, and two-thirds of the missing data are in the first two years, namely 2010 and 2011. This is a problem because the training data will be from the earlier period, and distorted target variable values in the training set could result in a model that performs poorly on the later test period.

The seasonalities described above suggest that simply imputing the overall mean, median or mode to replace the NaN values is not a good idea. Neither would random assignment from the available data or carrying-forward the last observation. Some of the (hourly) missing data also flow continuously across days, so linear interpolation that flows from one day through the next would also introduce distortions (remember the intraday seasonality). These typical easy missing data imputation methods in the face of complex seasonalities in the target variable would certainly introduce distortions in the data.

Finally, the fact that the missing values are entirely in the target variable suggests that any attempt to impute those values from the other variables (i.e. the subsequent explanatory variables) means that one is setting up manufactured target values that would be easily predicted later by those same explanatory variables in whatever algorithm one chooses to use.

This exercise is not supposed to be a comprehensive coverage of imputation methods on the target variable, so I decided on a not-too-complicated solution to the problem that respects the hourly and monthly seasonalities. A three-stage process was executed:

  1. Delete the missing first day’s observations (1 January 2010 hourly PM 2.5 readings were all NaNs).
  2. Interpolate “inside” missing values in the observations between 0:00 and 14:00 (inclusive) in the day, and then the observations between 15:00 and 23:00 (inclusive). This two-step procedure respects the two-part intraday seasonality in the data as discussed above.
  3. Impute the remaining missing values by the median of the available values grouped by month and hour of each respective year. I decided on using the median value rather than the mean given the significant occurrences of extreme values in the PM 2.5 readings as noted previously.
Code for two-step interpolation of hourly PM 2.5 across the individual days

Correlation Matrices

Following the missing data imputation, I then explored the correlation matrix of the target variable against the other continuous variables in the dataset. Many of the weather-related variables seem to have rather weak correlations to the target variable.

At the same time, there are also high correlations between several of the weather variables, such as among “dewp”, “temp” and “pres”. These are dew point, temperature and pressure readings. If one of these three variables are to be discarded, “pres” seems the most obvious given its high pairwise correlations to the other two variables (“temp” & “dewp”), and the lowest correlation of the three to the target variable. So I dropped the “pres”, “cr” and “cs” variables from the analysis.

Previous studies on PM 2.5 pollution around the world identified wind strength as a major explanatory factor. Basically, strong persistently windy conditions are effective in dispersing airborne pollution. The “cws” variable in the dataset is a cumulative wind strength variable, and we see from above that it has a moderate +0.25 correlation to the target variable.

In this particular case, there is an interesting interaction between wind speed and wind direction. Northerly winds generate stronger correlations between “cws” and PM 2.5, and also reduce the correlation between temperature and PM 2.5. You can observe this phenomenon in the two correlation matrices below, with the first one when wind direction is mixed/uncertain and the second when there is a north-easterly wind blowing.

The findings suggest that I should craft an interation variable between northerly wind direction and “cws”.

Autocorrelation & Stationarity

As this is a hourly time series, and on air pollution, it is logical that the autocorrelation in the dependent variable will be high. PM 2.5 particulates in the air are unlikely to suddenly appear or disappear from one hour to the next. They instead accumulate or dissipate gradually over time. Thus, there is significant autocorrelation in the lags of the target variable as we see on the left plot below.

However, the partial autocorrelations drop away rapidly after two lags. Time series aficionados will immediately recognise these as indications of at least an AR(2) series. If I were to run an ARIMA model, for example, I would include at least two lags of the target variable in the model. While I do not use any econometric time series model in the analysis, it does suggest that lags of PM 2.5 readings should be incorporated as features in the model.

Furthermore, the lack of a trend in the PM 2.5 particulate readings suggests that the target variable is naturally stationary. In fact, all the continuous variables in the dataset are stationary. Augmented Dickey-Fuller tests confirm that the stationary null hypothesis was rejected for all the variables.

The BIC results from Statsmodels’ auto arma tool recommend just two lags of the target variable, while the AIC results suggest nineteen lags. I would prefer more parsimony in this case. It is interesting to note that the AIC score hit an initial low at three lags before dropping further with greater lags later on. In the end, I decided to go with three lags of the target variable.

List of Explanatory Variables

That was a lot to go through, but entirely necessary. Feature engineering is often more important than running fancy algorithms or hyper-parameter tuning, which I will also do! In the end, I arrived at the following list of variables:


  • One-period ahead PM 2.5 reading


  • Current and two lags of “PM 2.5”
  • “temp” – temperature (current)
  • “dewp” – dew point (current)
  • “cws” – cumulated wind speed (current)
  • three “cbwd” dummies – wind direction (current)
  • interactive variable of northerly wind direction & “cws” (current)
  • Lunar New Year dummy
  • major sandstorm dummy
  • weekend dummy
  • hour dummies
  • month dummies

The last five dummy variables are all set one-period ahead or concurrent with the target variable. This is because the LNY date, major sandstorms, weekends, hours and months may all be correctly forecasted at least one hour ahead. We may not know the wind direction in the next hour or two, but we can certainly agree that 9pm will follow 8pm, or October will follow September, or that the next Lunar New Year will fall on 1 February 2022.

As an aside, I am aware that cyclic time variables forged through trigonometric transformations of time and calendar variables can often be a useful alternative to dummies. But cyclic time variables cannot be used in tree-based models, and so I decided against them here because I am interested in using such models in the stack ensemble.

Or the other hand, it is entirely possible to feed differently tailored features to different algorithms in a stack ensemble. I have devised such stacks. However, let us keep things more straightforward in this exercise.

Stacking Process

Now we finally arrive at the modelling. I first split the data three ways: gridsearch cross-validation training data, meta model training data, and holdout test data. As I am dealing with time series in this case, there is a need to respect the flow of time in the splits.

Data split three ways according to the temporal flow of time (image by author)

The latest 10% (in terms of the time flow) of the data is used as the holdout test set, comprising 4,380 observations. Of the remaining 90% of the data, the earliest two-thirds of observations are allocated to the gridsearch training data (first batch of training data), and the later one-third to the meta model training data (second batch).

As for the data scaling procedure through StandardScaler(), only the gridsearch (first batch) training data is fitted and transformed. The other two sub-samples are only transformed.

Each of the base models is tuned through GridsearchCV (using TimeSeriesSplit==3) to find their optimal hyper-parameter settings on the gridsearch (first batch) training data. Each post-gridsearch model is then trained on the full first batch training data. To avoid any confusion, I say “full” in this instance because the GridsearchCV procedure only fits the model to a portion of the gridsearch training data at each iteration. I reproduce the earlier diagram of the modelling methodology below.

Stack model diagram, with data split three ways from earliest batch at bottom to latest test set at top (image by author)

The meta model training set (second batch of data) is then fed to each trained base model to produce predictions of the target variable. These predictions are subsequently used as the explanatory variables in the meta model. The meta OLS model is basically trained by regressing the target values in the meta training set on the base models’ predictions.

The last step is to have the six base models (still only trained on the first batch of data) generate their respective predictions of the target variable in the holdout test set. These predictions are then fed to the meta OLS model (trained as above) to produce the stack model’s predictions of the target values in the test set.

A summary of the procedure is as follows:

  1. Split data three ways, holding the latest 10% of the data as the holdout test set, and splitting the remaining 90% into an earlier gridsearch training set (2/3) and a later meta model training set (1/3);
  2. GridsearchCV the six base models to find their respective optimal hyper-parameters, and train the optimised models on the full gridsearch training data;
  3. The base models’ predictions on the meta training set form the explanatory variables to train the meta model on the target variable; and
  4. The base models finally make their predictions on the holdout test set, which are again fed into the meta model, and the stack’s predictions of the target variable in the holdout test set are scored.

All the models are only fitted/trained once. The base models are only trained on the gridsearch (first batch) training data, and the meta model on the base models’ predictions on the second batch training data. In any event, the full code is available at my GitHub page with the link at the bottom of the article for those who are interested in scrutinising the details.

Cross-Validation Sidetrack

The optimised base models following GridsearchCV are listed below. As an aside, I carried out a separate cross-validation exercise on the stack ensemble using the post-gridsearch hyper-parameter settings for the base models. No gridsearch was carried out on the OLS meta model, and there is not much need anyway as the single hyper-parameter is just the constant term (and to some extent whether or not to normalise the explanatory variables’ data), and I always keep it for this exercise.

Post-gridsearch list of base models

For the cross-validation exercise, I combine both batches (base+meta) of the training set to reconstitute the full 90% training data), and set CV=5 on this data sub-sample for the iterations. The cross-validation is done through the forward chaining or expanding window approach. A good explanation of this approach for time series CV is available at the following StackExchange discussion. I reproduce the chart used in the discussion below to illustrate the approach, though it should be emphasised that the “Data” in the chart should not include the holdout test set.

Sklearn’s TimeSeriesSplit function cannot be used in the cross-validation procedure unfortunately. This is because the base model’s predictions form the inputs for the meta model. So we need to draw out each base model’s predictions during their respective cross-validation process in each fold, and then feed these predictions to the meta model as it is undergoing its own cross-validation. So a bespoke code is required.

To ensure no possibility of any data contamination between the stack layers before scoring the holdout test set later on, the cross-validation procedure described here is implemented on a separate Jupyter Notebook, which you may inspect on my GitHub page.

The MAE and RMSE scores from the 5-fold cross-validation procedure are presented below, and the “Stack Model”, which is the OLS meta model, has the lowest mean and median MAE scores. Among the RMSE scores, however, the “Stack Model” has the lowest median score but the second lowest mean score.

Cross-validation MAE scores & statistics
Cross-validation RMSE scores & statistics

The stack ensemble methodology appears to be largely working as one hopes, namely producing more accurate results than any of the base models in the ensemble. It is not exactly quite there, but it looks promising as we head to the holdout test set.

Holdout Test Results

One should always have a baseline model for comparison, and the typical time series baseline is the “Persistence Model”. This is simply the model of the target variable being predicted by its lagged value. In this case, it would be the one-hour ahead PM 2.5 reading being predicted by its current reading.

As mentioned, the base models are only trained once, on the gridsearch training set, and they then make their predictions on the target variable’s values in the holdout test set. These predictions form the features of the OLS meta model, which then makes the stack’s predictions on the test set.

A couple of selected charts of the Stack Model’s predictions versus the actual target values in the test set are presented below. I shorten the charts’ coverage to 400 (hourly) observations to render them easier to follow. Here below we see the first 400 observations.

Predictions versus actual PM 2.5 readings in the first 400 observations of test set

Then the middle 400 observations are shown below.

Predictions versus actual PM 2.5 readings in the middle 400 observations of test set

We see that the Stack Model achieved the lowest MAE and RMSE scores on the holdout test set. Moreover, compared to the baseline Persistence Model, the stack ensemble achieved an improvement of 5.5% in the RMSE (18.66 versus 19.76) and 5.3% in the MAE (10.53 versus 11.12) scores.

Test scores on the holdout test set

One may also note that the stack’s RMSE score on the test set is within the range observed in the 5-fold cross-validation procedure above, but the MAE score is below the range. In general, the error scores on the holdout test set tend to be lower than those observed in the cross-validation process across all the models, implying that the test set was generally easier to predict. This highlights the importance of undertaking multi-fold cross-validation in assessing model performance.


Machine learning algorithms are well-suited to time series forecasting. I utilised a machine learning stack to forecast one-hour ahead PM 2.5 air pollution levels. The stack ensemble included a diverse mix of linear models, tree-based models, support vector models and neural networks as base models. The final meta model was the all-time favourite OLS.

The data exhibited significant outliers and complex seasonalities, and was plagued by missing target values. Fancy algorithms and methodologies are never a substitute for careful pre-modelling data analysis and engineering. Following this, the data was split three-ways, according to its temporal order, with the latest 10% of the data taken as the holdout test set. The remaining 90% of the data was in turn spliced into an earlier gridsearch training set (2/3) for the base models, and a later meta training set (1/3) for the meta model.

The training data (the 90% above) was also used to run a 5-fold forward chain cross-validation procedure to assess model performance, for all the models used. The cross-validation found that the Stack Model generally outperformed the individual base models on MAE and RMSE scores.

The subsequent results on the holdout test set showed the Stack Model with the best MAE and RMSE scores. The stack’s scores also evinced a 5–6% improvement over the baseline Persistence Model. In conclusion, the exercise demonstrated the effectiveness of a machine learning ensemble stack approach to multivariate time series data.

(The full Python code and data for this exercise is available at my GitHub page.)


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: