N-BEATS : Time-Series Forecasting with Neural Basis Expansion

Original Source Here

N-BEATS: Time-Series Forecasting with Neural Basis Expansion

Created with DALLE [1]

There’s one thing that makes Time-Series Forecasting special.

It was the only area of Data Science where Deep Learning and Transformers didn’t decisively outperform the other models.

Let’s use the prestigious Makridakis M-competitions as a benchmark — a series of large-scale challenges that showcase the latest advances in the time-series forecasting area.

In the fourth iteration of the competition, known as M4, the winning solution was ES-RNN [2], a hybrid LSTM & Exponential Smoothing model developed by Uber. Interestingly, the 6 (out of 57) pure ML models performed so poorly, they barely surpassed the competition baseline.

That changed one year later. Elemental AI (co-founded by Yoshua Bengio) published N-BEATS [3], a pure Deep-Learning model that outperformed the winning ES-RNN model of M4 by 3%. But there’s more.

In this article, we describe in depth:

  1. The architecture of N-BEATS, how the model works, and why is so powerful.
  2. How N-BEATS produces interpretable forecasts.
  3. How N-BEATS achieves unparallel zero-shot transfer learning.
  4. Why ARIMA cannot natively support transfer learning.

Let’s dive in.

If you are interested in Time-Series Forecasting, check my curated collection of the best Deep Learning models and tutorials.

What is N-BEATS

N-BEATS is a fast, interpretable DL model that recreates the mechanisms of statistical models using double residual stacks of fully connected layers.

N-BEATS stands for Neural Basis Expansion Analysis for Time Series, a revolutionary model created by ElementAI. This company was co-founded by Yoshua Bengio and was later acquired by ServiceNow.

N-BEATS is an interesting forecasting model because:

  • It is the first pure Deep Learning model that outperformed all well-established statistical approaches.
  • It provides interpretable forecasts.
  • It sets the basis for Transfer Learning on time series.

Around that time, Amazon published its novel time-series model, known as DeepAR [4]. Although DeepAR contains Deep Learning components, the model employs a few statistical concepts as well (maximum likelihood estimation).

N-BEATS — Overview

Let’s briefly discuss a few key traits of N-BEATS:

  • Multiple time-series support: N-BEATS can be trained on multiple time series, each one representing a different distribution.
  • Fast Training: The model does not contain any Recurrent or self-attention layers — thus, faster training & stable gradient flow.
  • Multi-horizon forecasting: The model produces multi-step predictions.
  • Interpretability: The authors developed 2 model versions, the generic version, and the interpretable version. The interpretable version can output interpretable predictions, regarding trend and seasonality.
  • Zero-shot Transfer Learning: The model can transfer its knowledge to other time-series datasets with astounding success.

Note: The original N-BEATS implementation by ElementAI works on univariate time-series only. The Darts library has released an updated version that supports multivariate time-series and probabilistic outputs. In this article, we focus on the original version.

N-BEATS — Generic Architecture

The N-BEATS architecture is deep, yet very simple. Figure 1 displays the top-level view:

Figure 1: The top-level architecture of N-BEATS (Source)

Notice 3 things:

  1. The block (blue color) — the basic processing unit.
  2. The stack (orange color) — a collection of blocks.
  3. The final model (yellow color) — a collection of stacks.

Every neural network layer in the model is just a dense (fully-connected) layer.

Let’s start with the first component, the basic block:

1. The Basic Block

Suppose H is the forecasting horizon. In N-BEATS, the lookback window is a multiple of the horizon H .

Figure 2 displays the architecture of the basic block:

Figure 2: The basic Block architecture (Source)

Let’s take a look under the hood.

In Figure 3, we use the parameters of the paper’s benchmark from the Electricity dataset [5]:

Figure 3: All operations inside the basic Block (Image by author)

Let’s see what happens here:

  • The model looks back 3 days = 72 hours =3 horizons to predict the power usage of the next 24 hours.
  • The block receives the lookback window input.
  • The input then passes through a 4-layer neural network.
  • The result of this computation is directed to 2 outputs. Here, the dense layers Dense 5 estimate the theta parameters (θ^b and θ^f), which are called expansion coefficients.
  • These parameters are then linearly projected into a new space using the basis layer transformations g^b and g^f to produce the backcast and forecast signals. This process is called “neural basis expansion”.

So, how the backcast and forecast vectors are useful?

The backcast signal is the best-approximated vector that can optimally predict the forecast signal, given the g^b and g^f transformations. When g^b and g^f take specific forms, the backcast and forecast vectors become interpretable (more to that later).

2. The Stack

To increase the effectiveness of the neural expansion process, the authors stack many blocks together. This structure is displayed in Figure 4:

Figure 4: Stack of blocks (left) and stack of stacks (right) — (Source)

Only the first block receives the original sequence input. The downstream blocks receive the backcast signal x_l+1 from the previous block (where l is the block index, i.e. l_1 is the first block in the stack).

Figure 5: Operations inside the block (Image by author)

Inside each stack, the backcast and forecast signals are organized into two branches: This topology is called doubly residual stacking and can be described by the following equations:

In each block, the model removes the part of the backcast signal _l from the input x_l that has approximated well. In other words:

The model at each block learns to optimally approximate a portion of the input signal, and sends the rest to be approximated by the downstream blocks.

Since each block models only a portion of the input signal, the final forecast is the sum of all forecast ŷ signals from all blocks.

Finally, the stacks are also stacked (Figure 4, right). This architectural choice further increases the depth of the model and enhances its ability to learn complex time sequences.

We have seen how the generic version of N-BEATS works. Next, we will describe the interpretable version.

Are N-BEATS and ARIMA related?

If you are familiar with ARIMA, you might have noticed a few similarities with the N-BEATS approach.

ARIMA is modeled using the Box-Jenkins method, which is an iterative process. Specifically:

  1. First, we guess the orders of the AR() and MA() functions the (p and q parameters).
  2. Afterwards, we estimate the coefficients of these parameters using e.g. maximum likelihood estimation.
  3. Then, we verify if the model’s conditions hold. For instance, the model’s residual errors should be normal and independent. If not, we return to step 1 and repeat the process. This time, we add new p and q degrees on top of the previous ones.

In other words, in each step of Box-Jenkins, we add more information to our model. Each iteration creates a better representation of the input, based on the model residuals.

Hence, we can conclude that:

In N-BEATS, each successive block models only the residual error due to the reconstruction of the backcast from the previous block and then updates the forecast based on that error. This process mimics the Box-Jenkins method when fitting ARIMA models.

The main difference between the 2 approaches is the target function of the residuals. ARIMA focuses on the quality of the residuals, while N-BEATS uses an arbitrary loss function.

Plus, we don’t manually adjust any equation with N-BEATS— the basis transformations are automatically optimized with backpropagation. With ARIMA however, we make heavy use of autocorrelation and partial autocorrelation plots to guess the order of AR() and MA() functions.

Note: Some ARIMA libraries implement the Box-Jenkins method with slight variations, depending on the programming language and the library. Here, we document the textbook implementation.

N-BEATS — Interpretable Architecture

With a few changes, the N-BEATS model can become interpretable: These are:

  • We use only 2 stacks, the trend and seasonality stacks. The generic architecture uses at least 30.
  • Both trend and seasonality stacks contain 3 blocks. In the generic architecture, we have one block per stack.
  • The basis layer weights of g^b and g^f are shared at the stack level.

The basic idea is that g^b and g^f basis take specific forms. Let’s describe them in more detail.

The Trend Block

Our goal is to restructure the g^b and g^f functions as monotonic that vary slowly across the forecast window.

Given a time vector t=[0,1,2,…,H−2,H−1] (H is the horizon), the thetas θ from the previous layer and the polynomial degree p, the trend model is defined as:

In other words, we use the architecture of Figure 3 (the generic block) and swap the last linear layer with the above operation. The result is shown in Figure 6:

Figure 6: The Trend block (Image by author)

The backcast equations are not described in the paper, but they can easily be derived from the project’s implementation. Moreover, the trend and seasonality blocks (Figure 6 and Figure 7) adopt the parameters from the official project repo:

interpretable.seasonality_layer_size = 2048
interpretable.seasonality_blocks = 3
interpretable.seasonality_layers = 4
interpretable.trend_layer_size = 256
interpretable.degree_of_polynomial = 3
interpretable.trend_blocks = 3
interpretable.trend_layers = 4
interpretable.num_of_harmonics = 1

The Seasonality Block

Similarly, we swap the final layer with appropriate g^b and g^f functions that capture seasonality. An excellent candidate would be the Fourier series:

Then, the architecture becomes:

Figure 7: The Seasonality block (Image by author)

Again, we stress that in all interpretable stacks, the g^b and g^f weights within the stack are shared.

Experimental Results

Finally, the authors tested the performance of N-BEATS in 3 popular time-series datasets — M3[6], M4[7], and Tourism[8].

Experimental Setup

The authors categorized all models into specific classes and compared N-BEATS with the best model of each class. For example, the DL/TS hybrid is the winning ES-RNN model on M4.

Since all these datasets were used for data science competitions, all participants relied on ensembling to achieve maximum performance. Hence, the N-BEATS authors relied on ensembling to be comparable. They used three variations: N-BEATS-G (generic), N-BEATS-I (interpretable), and N-BEATS-I+G (ensemble of all models from N-BEATS-G and N-BEATS-I).

On top of that, they created 6 different models with look back windows 2H, 3H .. 7H for every horizon and variation. For extra details about the ensembling configurations, check the original paper. In total, the authors ensembled 180 models to report the final results on the test set.


The results for all datasets are shown in Figure 8:

Figure 8: Experimental results on M3, M4, and Tourism datasets (Source)

The results are quite impressive.

The N-BEATS outperforms the other implementations on all datasets, with N-BEATS-I+G being the most successful.

Note that in every dataset, the competitions use the MAPE, sMAPE, and OWA metrics (lower is better). These metrics are popular in time-series competitions.

Note: Contrary to the other approaches, N-BEATS does not require any hand-crafted feature engineering, or input scaling. Thus, N-BEATS is far more easier to use in different time-series tasks.

Zero-shot transfer learning

The main contribution of N-BEATS is its ability to successfully implement transfer learning on time series.


Transfer learning is a more general term— it refers to how a model can transfer its knowledge across different datasets. This is already established on Computer Vision or NLP tasks: We can download a pretrained model and adjust it to our dataset with fine-tuning.

Meta-learning (or few-shot-learning) is when the model can adapt to our dataset with little training/fine-tuning. The best scenario is zero-shot learning, where the model is not trained on the target dataset.

Zero-shot learning is the model’s ability to make predictions using unseen data, without having specifically trained on them. This learning method better reflects the human perception.

Besides, this new paradigm shift towards meta-learning has been embraced by the latest AI research, such as OpenAICLIP [9] and Whisper[10] are a few of them.

Zero-Shot N-BEATS

Yoshua Bengio (co-author of N-BEATS) has already established the theoretical foundation of transfer learning on forecasting tasks in his previous work [11].

The authors of N-BEATS published a follow-up paper[12] where it summarizes most of this work, including what requirements a time-series forecasting model should meet to perform efficient transfer learning.

Let’s focus on N-BEATS.

The authors state that N-BEATS’s meta-learning ability hinges on two procedures: i) the inner learning procedure and the outer learning procedure. They are shown in Figure 9:

Figure 9: Meta-learning procedures in N-BEATS (Source, edited by author)

The inner loop takes place inside each block and focuses on learning task-specific characteristics.

The outer loop takes place at the stack level. Here, the model learns global characteristics across all tasks.

In other words, the inner loop learns local temporal traits, while the outer loop learns longer dependencies across all time-series.

However, this begs the following section:

Why is ARIMA not suitable for Transfer Learning?

If an established paradigm dictates which criteria a forecasting model should meet to be appropriate for transfer learning, then why ARIMA is not?

To answer this question, we will again focus on the two learning procedures described by [12]

When creating an ARIMA model, there are 2 challenges:

  1. Parameter estimation: Parameters are estimated using a statistical technique like maximum likelihood. This is the inner loop.
  2. Model formulation: This defines the form of the autoregressive equation. For example, if our model has a little trend, no seasonality, and normal residuals, we can decide that a Gaussian ETS will probably do the job. This is the outer loop.

Notice that statistical models only pass the first criterion, the inner loop.

The parameter estimation part is straightforward once we have chosen our model. However, the model formulation part requires human intervention. Therefore, regarding the statistical approaches, the role of the outer loop is not fulfilled.

Thus, we conclude:

N-BEATS replaces the predefined set of rules for model parameter estimation of classical statistical models with a learnable parameter estimation strategy. This strategy allows N-BEATS to generalize well on multiple, unseen time sequences.

Zero-Shot learning results of N-BEATS

In this experimental analysis, the authors enrich the results of the previous benchmark (Figure 8) with some new models:

  • N-BEATS-M4: The authors create a pretrained N-BEATS models on the M4.
  • DeepAR-M4: A pretrained DeepAR model on the M4 is also added to the pool of models.

The total results are shown in Figure 10:

Figure 10: Comparison of zero-shot models (Source)

Again, the results are very interesting.

  • Zero-shot N-BEATS-M4 outperforms all the other models on M4 and Tourism datasets (including the winners), even though it was not trained on them.
  • Zero-shot DeepAR-M4 seems to perform poorly. This is expected since DeepAR is not suitable for transfer learning.
  • In every dataset, the zero-shot N-BEATS model performs very well compared to the tailored-trained N-BEATS. It would have been interesting to see an effective-robustness vs overall-robustness plot that is found in other zero-shot models such as Whisper.

Closing Remarks

N-BEATS is a breakthrough Deep Learning forecasting model that has left a lasting impact on the time-series field.

In this article, we described the two main strengths of N-BEATS: First, it’s a powerful model that produces SOTA results. Secondly, N-BEATS establishes a well-defined framework for implementing zero-shot transfer learning. To the best of my knowledge, this is the first model to achieve this successfully.

In our next articles, we will present a programming tutorial with N-BEATS, and describe a newer paper, called N-HiTS.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: