Original Source Here

# N-BEATS: Time-Series Forecasting with Neural Basis Expansion

**There’s one thing that makes Time-Series Forecasting special.**

It was the only area of Data Science where Deep Learning and Transformers didn’t decisively outperform the other models.

Let’s use the prestigious **Makridakis M-competitions **as a benchmark — a series of large-scale challenges that showcase the latest advances in the time-series forecasting area.

In the fourth iteration of the competition, known as **M4**, the winning solution was ES-RNN [2], a hybrid LSTM & Exponential Smoothing model developed by Uber. Interestingly, the 6 (out of 57) pure ML models performed so poorly, they barely surpassed the competition baseline.

That changed one year later. *Elemental AI *(co-founded by **Yoshua Bengio**) published **N-BEATS [3]**** ,** a pure Deep-Learning model that outperformed the winning ES-RNN model of M4 by 3%. But there’s more.

In this article, we describe in depth:

**The architecture of***N-BEATS*, how the model works, and why is so powerful.**How***N-BEATS*produces interpretable forecasts.**How***N-BEATS*achieves unparallel zero-shot transfer learning.**Why***ARIMA*cannot natively support transfer learning.

Let’s dive in.

If you are interested in Time-Series Forecasting, check my curated collection of the best Deep Learning models and tutorials.

# What is N-BEATS

N-BEATS is a fast, interpretable DL model that recreates the mechanisms of statistical models using double residual stacks of fully connected layers.

**N-BEATS** stands for *N**eural **B**asis **E**xpansion** A**nalysis for **T**ime **S**eries, *a revolutionary model created by **ElementAI. **This company was co-founded by **Yoshua Bengio** and was later acquired by ServiceNow.

*N-BEATS*** **is an interesting forecasting model because:

- It is the first pure Deep Learning model that outperformed all well-established statistical approaches.
- It provides interpretable forecasts.
- It sets the basis for
**Transfer Learning**on time series.

Around that time, **Amazon** published its novel time-series model, known as **DeepAR [4]****. **Although *DeepAR *contains* *Deep Learning components, the model employs a few statistical concepts as well (maximum likelihood estimation).

# N-BEATS — Overview

Let’s briefly discuss a few key traits of *N-BEATS*:

**Multiple time-series support**:*N-BEATS*can be trained on multiple time series, each one representing a different distribution.**Fast Training:**The model does not contain any Recurrent or self-attention layers — thus, faster training & stable gradient flow.**Multi-horizon forecasting**: The model produces multi-step predictions.**Interpretability:**The authors developed 2 model versions, theversion, and the*generic*version. The interpretable version can output interpretable predictions, regarding trend and seasonality.*interpretable***Zero-shot Transfer Learning:**The model can transfer its knowledge to other time-series datasets with astounding success.

Note:The original N-BEATS implementation by ElementAI works on univariate time-series only. The Darts library has released an updated version that supports multivariate time-series and probabilistic outputs. In this article, we focus on the original version.

# N-BEATS — Generic Architecture

The *N-BEATS* architecture is deep, yet very simple. **Figure 1** displays the top-level view:

Notice 3 things:

- The
(blue color) — the basic processing unit.*block* - The
(orange color) — a collection of blocks.*stack* - The
(yellow color) — a collection of stacks.*final model*

Every neural network layer in the model is just a dense (fully-connected) layer.

Let’s start with the first component, the basic block:

## 1. The Basic Block

Suppose `H`

is the forecasting horizon. In *N-BEATS*, the lookback window is a multiple of the horizon `H`

.

**Figure 2** displays the architecture of the basic block:

Let’s take a look under the hood.

In **Figure 3, **we** **use the parameters of the paper’s benchmark from the **Electricity dataset [5]**:

Let’s see what happens here:

- The model looks back
`3 days`

=`72 hours`

=`3 horizons`

to predict the power usage of the next 24 hours. - The block receives the lookback window input.
- The input then passes through a 4-layer neural network.
- The result of this computation is directed to 2 outputs. Here, the dense layers
**Dense 5**estimate the theta parameters (`θ^b`

and`θ^f`

), which are called**expansion coefficients**. - These parameters are then linearly projected into a new space using the
**basis**layer transformations`g^b`

and`g^f`

to produce theand*backcast*signals. This process is called “*forecast***neural basis expansion**”.

So

,how the backcast and forecast vectors are useful?

The **backcast** signal is the best-approximated vector that can optimally predict the **forecast** signal, given the `g^b`

and `g^f`

transformations. When `g^b`

and `g^f`

take specific forms, the backcast and forecast vectors become **interpretable **(more to that later).

## 2. The Stack

To increase the effectiveness of the *neural expansion process, *the authors stack many blocks together. This structure is displayed in **Figure 4:**

Only the first block receives the original sequence input. The downstream blocks receive the backcast signal **x_l+1**** **from the previous block (where `l`

is the block index, i.e. `l_1`

is the first block in the stack).

Inside each stack, the backcast and forecast signals are organized into two branches: This topology is called **doubly residual stacking **and can be** **described by the following equations:

In each block, the model removes the part of the backcast signal

from the input **x̂**_l

that has approximated well. In other words:**x**_l

The model at each block learns to optimally approximate a portion of the input signal, and sends the rest to be approximated by the downstream blocks.

Since each block models only a portion of the input signal, the final forecast is the sum of all *forecast* `ŷ`

signals from all blocks.

Finally, the stacks are also stacked (**Figure 4,** right). This architectural choice further increases the depth of the model and enhances its ability to learn complex time sequences.

We have seen how the generic version of *N-BEATS* works. Next, we will describe the *interpretable* version.

# Are N-BEATS and ARIMA related?

If you are familiar with **ARIMA**, you might have noticed a few similarities with the *N-BEATS *approach*.*

*ARIMA* is modeled using the **Box-Jenkins** method, which is an iterative process. Specifically:

- First, we guess the orders of the
**AR()**and**MA()**functions the (`p`

and`q`

parameters). - Afterwards, we estimate the coefficients of these parameters using e.g. maximum likelihood estimation.
- Then, we verify if the model’s conditions hold. For instance, the model’s residual errors should be normal and independent. If not, we return to step 1 and repeat the process. This time, we add new
`p`

and`q`

degrees on top of the previous ones.

In other words, in each step of Box-Jenkins, we add more information to our model. Each iteration creates a better representation of the input, based on the model residuals.

Hence, we can conclude that:

In N-BEATS, each successive block models only the residual error due to the reconstruction of the backcast from the previous block and then updates the forecast based on that error. This process mimics the Box-Jenkins method when fitting ARIMA models.

The main difference between the 2 approaches is the target function of the residuals. *ARIMA* focuses on the quality of the residuals, while *N-BEATS *uses an arbitrary loss function.

Plus, we don’t manually adjust any equation with *N-BEATS*— the basis transformations are automatically optimized with backpropagation. With *ARIMA* however, we make heavy use of autocorrelation and partial autocorrelation plots to guess the order of `AR()`

and `MA()`

functions.

Note:Some ARIMA libraries implement the Box-Jenkins method with slight variations, depending on the programming language and the library. Here, we document the textbook implementation.

# N-BEATS — Interpretable Architecture

With a few changes, the *N-BEATS* model can become interpretable: These are:

- We use only 2 stacks, the
**trend**and**seasonality**stacks. The generic architecture uses at least 30. - Both trend and seasonality stacks contain 3 blocks. In the generic architecture, we have one block per stack.
- The
**basis**layer weights of`g^b`

and`g^f`

are shared at the stack level.

The basic idea is that `g^b`

and `g^f`

basis take specific forms. Let’s describe them in more detail.

**The Trend Block**

Our goal is to restructure the `g^b`

and `g^f`

functions as monotonic that vary slowly across the forecast window.

Given a time vector `t=[0,1,2,…,`

(*H*−2,*H*−1]`H`

is the horizon), the thetas `θ`

from the previous layer and the polynomial degree `p`

, the trend model is defined as:

In other words, we use the architecture of **Figure 3 **(the generic block) and swap the last linear layer with the above operation. The result is shown in **Figure 6:**

The backcast equations are not described in the paper, but they can easily be derived from the project’s implementation. Moreover, the trend and seasonality blocks (**Figure 6** and **Figure 7**) adopt the parameters from the official project repo:

`interpretable.seasonality_layer_size = 2048`

interpretable.seasonality_blocks = 3

interpretable.seasonality_layers = 4

interpretable.trend_layer_size = 256

interpretable.degree_of_polynomial = 3

interpretable.trend_blocks = 3

interpretable.trend_layers = 4

interpretable.num_of_harmonics = 1

## The Seasonality Block

Similarly, we swap the final layer with appropriate `g^b`

and `g^f`

functions that capture seasonality. An excellent candidate would be the Fourier series:

Then, the architecture becomes:

Again, we stress that in all interpretable stacks, the `g^b`

and `g^f`

weights within the stack are shared.

# Experimental Results

Finally, the authors tested the performance of *N-BEATS *in 3 popular time-series datasets — M3[6], M4[7], and Tourism[8].

## Experimental Setup

The authors categorized all models into specific classes and compared *N-BEATS* with the best model of each class. For example, the *DL/TS hybrid* is the winning ES-RNN model on M4.

Since all these datasets were used for data science competitions, all participants relied on ensembling to achieve maximum performance. Hence, the *N-BEATS* authors relied on ensembling to be comparable. They used three variations: **N-BEATS-G **(generic)**, N-BEATS-I **(interpretable),** **and** N-BEATS-I+G** (ensemble of all models from N-BEATS-G and N-BEATS-I).

On top of that, they created 6 different models with look back windows `2H, 3H .. 7H`

for every horizon and variation. For extra details about the ensembling configurations, check the original paper. In total, the authors ensembled 180 models to report the final results on the test set.

## Results

The results for all datasets are shown in **Figure 8:**

The results are quite impressive.

The *N-BEATS* outperforms the other implementations on all datasets, with *N-BEATS-I+G* being the most successful.

Note that in every dataset, the competitions use the MAPE, sMAPE, and OWA metrics (lower is better). These metrics are popular in time-series competitions.

Note:Contrary to the other approaches, N-BEATS does not require any hand-crafted feature engineering, or input scaling. Thus, N-BEATS is far more easier to use in different time-series tasks.

# Zero-shot transfer learning

The main contribution of *N-BEATS* is its ability to successfully implement transfer learning on time series.

## Preliminaries

**Transfer learning** is a more general term— it refers to how a model can transfer its knowledge across different datasets. This is already established on Computer Vision or NLP tasks: We can download a pretrained model and adjust it to our dataset with fine-tuning.

**Meta-learning **(or few-shot-learning) is when the model can adapt to our dataset with little training/fine-tuning. The best scenario is ** zero-shot learning**, where the model is not trained on the target dataset.

Zero-shot learning is the model’s ability to make predictions using unseen data, without having specifically trained on them. This learning method better reflects the human perception.

Besides, this new paradigm shift towards meta-learning has been embraced by the latest AI research, such as OpenAI** — ****CLIP [9]**** **and** ****Whisper[10]** are a few of them.

## Zero-Shot N-BEATS

**Yoshua Bengio** (co-author of* N-BEATS*) has already established the theoretical foundation of transfer learning on forecasting tasks in his previous work [11].

The authors of *N-BEATS* published a follow-up paper[12] where it summarizes most of this work, including what requirements a time-series forecasting model should meet to perform efficient transfer learning.

Let’s focus on *N-BEATS.*

The authors state that *N-BEATS’s *meta-learning ability hinges on two procedures: i) the **inner learning** procedure and the **outer learning **procedure. They are shown in **Figure 9:**

**The inner loop** takes place inside each block and focuses on learning task-specific characteristics.

**The outer loop **takes place at the stack level. Here, the model learns global characteristics across all tasks.

In other words, the inner loop learns local temporal traits, while the outer loop learns longer dependencies across all time-series.

However, this begs the following section:

## Why is ARIMA not suitable for Transfer Learning?

If an established paradigm dictates which criteria a forecasting model should meet to be appropriate for transfer learning, then why ARIMA is not?

To answer this question, we will again focus on the two learning procedures described by [12]

When creating an ARIMA model, there are 2 challenges:

**Parameter estimation:**Parameters are estimated using a statistical technique like maximum likelihood.**This is the inner loop.****Model formulation:**This defines the form of the autoregressive equation. For example, if our model has a little trend, no seasonality, and normal residuals, we can decide that a Gaussian ETS will probably do the job.**This is the outer loop.**

Notice that statistical models only pass the first criterion, the inner loop.

The parameter estimation part is straightforward once we have chosen our model. However, the model formulation part requires human intervention. Therefore, regarding the statistical approaches, **the role of the outer loop is not fulfilled.**

Thus, we conclude:

N-BEATS replaces the predefined set of rules for model parameter estimation of classical statistical models with a learnable parameter estimation strategy. This strategy allows N-BEATS to generalize well on multiple, unseen time sequences.

## Zero-Shot learning results of N-BEATS

In this experimental analysis, the authors enrich the results of the previous benchmark (**Figure 8**) with some new models:

**N-BEATS-M4:**The authors create a pretrained*N-BEATS*models on the M4.**DeepAR-M4:**A pretrained*DeepAR*model on the M4 is also added to the pool of models.

The total results are shown in **Figure 10**:

Again, the results are very interesting.

- Zero-shot
outperforms all the other models on M4 and Tourism datasets (including the winners), even though it was not trained on them.*N-BEATS-M4* - Zero-shot
**DeepAR-M4**seems to perform poorly. This is expected since*DeepAR*is not suitable for transfer learning. - In every dataset, the zero-shot
*N-BEATS*model performs very well compared to the tailored-trained*N-BEATS*. It would have been interesting to see an effective-robustness vs overall-robustness plot that is found in other zero-shot models such as Whisper.

# Closing Remarks

*N-BEATS* is a breakthrough Deep Learning forecasting model that has left a lasting impact on the time-series field.

In this article, we described the two main strengths of *N-BEATS*: First, it’s a powerful model that produces SOTA results. Secondly, *N-BEATS* establishes a well-defined framework for implementing zero-shot transfer learning. To the best of my knowledge, this is the first model to achieve this successfully.

In our next articles, we will present a programming tutorial with *N-BEATS, *and describe a newer paper, called *N-HiTS*.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot