Original Source Here

# DeepAR: Mastering Time-Series Forecasting with Deep Learning

**A few years ago, time-series models worked on a single sequence only.**

Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.

The first model that could natively work on multiple time-series was **DeepAR[2]**, an *autoregressive recurrent network* developed by **Amazon**.

In this article, we will see how *DeepAR* works in-depth and why it is a milestone for the time-series community.

If you want to learn about the other deep learning models that were inspired by

DeepAR, check this article:

# What Is DeepAR

DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.

Let’s see why *DeepAR* stands out:

**Multiple time-series support:**The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.**Extra covariates:***DeepAR*allows extra features (covariates). For instance, if your task is temperature forecasting, you can include`humidity-level`

,`air-pressure`

etc.**Probabilistic output:**Instead of making a single prediction, the model leverages**quantile loss**to output prediction intervals.**“Cold” forecasting:**By learning from thousands of time-series that potentially share a few similarities,*DeepAR*can provide forecasts for time-series that have little or no history at all.

# LSTMs in DeepAR

*DeepAR* uses LSTM networks to create probabilistic outputs.

*Long Short-Term Memory Networks** (*LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:

- Plain LSTMs
- Multi-stacked LSTMs
- LSTMs with CNNs
- LSTMs with
*Time2Vec* - LSTMs in encoder-decoder topology
- LSTMs in encoder-decoder topology with
*attention***[3]**(**Figure 1**)

Moreover, while it is true that *Transformers* dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.

For more information regarding **Recurrent networks vs Transformers, **check this article.

# DeepAR — Architecture

Contrary to the previous models, *DeepAR* uses LSTMs a bit differently:

Instead of using LSTMs to calculate predictions directly, *DeepAR* leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the

parameters (*θ = (μ, σ)**mean* and *standard* deviation) of the Gaussian function.

**Figure 2** and **Figure 3** show the architecture overview of *DeepAR* in *training*and *inference* modes:

Let’s start with training. Suppose we are at the time step `t`

of the time-series `i`

:

- First, the LSTM cell takes as input the covariates
`x_i,t`

of the current time step`t`

and the target variable`z_i,t-1`

of the previous time step`t-1`

. Also, the LSTM receives the hidden state`hi,t-1`

of the previous time step. - Then, the LSTM cell outputs its hidden state
`hi,t`

which is fed to the next step. - The
*μ**σ*`hi,t`

and ‘become’ the parameters of a Gaussian likelihood function*p(y_i|θ_i)= l(*z_i,t|Θι,t*)**θ = (μ, σ)**.*Don’t worry if you don’t understand this part — we will explain it later in more detail. - In other words, the model tries to answer this: what are the best parameters
*μ**σ*`z_i,t`

as possible? - This concludes the training step
`t`

. The current target value`z_i`

and hidden state`hi,t`

are passed to the next time step and the training process continues. Since*DeepAR*trains (and predicts) a single data point each time, the model is called**autoregressive.**

The steps for inference are pretty much the same.

One thing changes though: Now, at each inference step `t`

, we use the predicted variable `ž_i,t-1`

that was sampled in the previous time step `t-1`

to calculate the new prediction `ž_i,t`

.

Remember, the `ž_i,t`

are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters *μ** *and* **σ** *directly.

We will see how those parameters are calculated in the next section.

# Gaussian Likelihood

Before delving into how *DeepAR’s* autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.

The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.

Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean

and standard deviation *μ* *σ**, *that is*θ = (μ, σ)**.* Hence, the gaussian likelihood ℓ, given *θ = (μ, σ)** *is defined as:

Now, take a look at **Figure 4**:

We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.

More formally, the task is to find the best *μ** and *

of the two distributions that optimally fit those data (*σ**DeepAR* assumes only one distribution). In statistics, this task is also called maximizing the **gaussian log-likelihood function**:

The function is maximized for all timesteps `t`

⋹ `[t…τmax]`

and `i`

⋹ `[1…N]`

, with `N`

being the total number of time-series in our dataset.

# Parameter estimation

In statistics, the parameters *μ** and **σ** *are normally estimated using the **MLEformulas** (**m**aximum **l**og-likelihood **e**stimators) that are derived by differentiating the likelihood function.

We don’t do that here.

Instead, we let the LSTM and 2 **Dense layers** derive those parameters based on the model’s input. This process is shown in **Figure 5:**

The process of estimating *μ** *and* **σ** *is straightforward:

- First, the
**LSTM**calculates its hidden state`hi,t`

. - Then,
`hi,t`

passes through a dense layer`W_μ`

to calculate the mean*μ**.* - Likewise, the same
`hi,t`

passes through a second dense layer`W_σ`

and calculate the mean*σ**.* - Now we have the
*μ**and**σ**.*The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observation`z_i,t`

. - That concludes training for the time step
`t`

. The LSTM weights and the 2 Dense layers`W_μ`

and`W_σ`

are trained during backpropagation.

In other words, *DeepAR* computes *μ** and **σ** *indirectly through `hi,t,`

`W_μ `

and`W_σ`

. This is done to make their calculation possible through backpropagation.

During inference, we do not have a target variable `z_i,t`

to compare. *DeepAR *has already learned all neural network weights and uses them to create the prediction `ž_i,t`

.

That’s it! We have now seen how *DeepAR* works end-to-end.

In the following sections, we will explain a few more mechanisms of *DeepAR*.

Note:The estimated mean and standard deviation parameters are formally symbolized in statistics with`μ hat`

and`σ hat`

.

# Auto Scaling

Dealing with multiple heterogeneous time-series is tricky.

Imagine a **product sales forecasting scenario**: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.

This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, *DeepAR* introduces an **auto-scaling mechanism. **More specifically, the model calculates an item-dependent `ν_ι`

to rescale the autoregressive inputs `z_i,t`

. This is given from the following formula:

Hence, at each time step `t`

, the autoregressive inputs `z_i,t`

from the previous step are first scaled by this factor.

Note:The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.

# DeepAR in the Time-Series Landscape

In this section, we discuss how *DeepAR* competes with other models as well as its limitations.

## Statistical models

The authors showed that *DeepAR* outperformed traditional statistical methods such as **ARIMA**. Also, the great advantage of *DeepAR* over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).

Amazon later released an updated version, called **DeepVAR[4]**, which significantly improves performance. We will describe this model in a future article.

## Deep Learning models

Since *DeepAR* was released, the research community has published numerous deep-learning models for time-series forecasting.

Not all of them can be directly compared to *DeepAR* because they work differently. To the best of my knowledge, the closest one that I can think of is **Temporal Fusion Transformer (TFT) [5].**

Let’s discuss two notable differences between *DeepAR* and TFT:

**1. Multiple Time-Series**

*DeepAR*calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps

*DeepAR*to distinguish the different time-series.

TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state `h_0`

of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.

**2. Type of Forecasting**TFT is not an

*autoregressive*model — it is classified as a

**. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.**

*multi-horizon forecasting model*The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.

# Closing Remarks

*DeepAR* is a remarkable Deep Learning model that constitutes a milestone for the time-series community.

Also, this model is prevalent in production: It is part of Amazon’s **GluonTS [6] **toolkit for time-series forecasting and can be trained on Amazon SageMaker.

In the next article, we will use *DeepAR* to create an end-to-end project.

Stay tuned!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot