Original Source Here
DeepAR: Mastering Time-Series Forecasting with Deep Learning
A few years ago, time-series models worked on a single sequence only.
Hence, if we had multiple time-series, one option was to create one model per sequence. Or, if we could “tabularize” our data, we could apply the gradient-boosted tree models — which work very well even today.
The first model that could natively work on multiple time-series was DeepAR, an autoregressive recurrent network developed by Amazon.
In this article, we will see how DeepAR works in-depth and why it is a milestone for the time-series community.
If you want to learn about the other deep learning models that were inspired by DeepAR, check this article:
What Is DeepAR
DeepAR is the first successful model to combine Deep Learning with traditional Probabilistic Forecasting.
Let’s see why DeepAR stands out:
- Multiple time-series support: The model is trained on multiple time-series, learning global characteristics that further enhance forecasting accuracy.
- Extra covariates: DeepAR allows extra features (covariates). For instance, if your task is temperature forecasting, you can include
- Probabilistic output: Instead of making a single prediction, the model leverages quantile loss to output prediction intervals.
- “Cold” forecasting: By learning from thousands of time-series that potentially share a few similarities, DeepAR can provide forecasts for time-series that have little or no history at all.
LSTMs in DeepAR
DeepAR uses LSTM networks to create probabilistic outputs.
Long Short-Term Memory Networks (LSTMs) are used in numerous time-series forecasting model architectures: For example, we can use:
- Plain LSTMs
- Multi-stacked LSTMs
- LSTMs with CNNs
- LSTMs with Time2Vec
- LSTMs in encoder-decoder topology
- LSTMs in encoder-decoder topology with attention  (Figure 1)
Moreover, while it is true that Transformers dominate the NLP field, they don’t decisively outperform LSTMs in time-series-related tasks. The main reason is that LSTMs are more adept at handling local temporal data.
For more information regarding Recurrent networks vs Transformers, check this article.
DeepAR — Architecture
Contrary to the previous models, DeepAR uses LSTMs a bit differently:
Instead of using LSTMs to calculate predictions directly, DeepAR leverages LSTMs to parameterize a Gaussian likelihood function. That is, to estimate the
θ = (μ, σ) parameters (mean and standard deviation) of the Gaussian function.
Figure 2 and Figure 3 show the architecture overview of DeepAR in trainingand inference modes:
Let’s start with training. Suppose we are at the time step
t of the time-series
- First, the LSTM cell takes as input the covariates
x_i,tof the current time step
tand the target variable
z_i,t-1of the previous time step
t-1. Also, the LSTM receives the hidden state
hi,t-1of the previous time step.
- Then, the LSTM cell outputs its hidden state
hi,twhich is fed to the next step.
σvalues are indirectly computed from
hi,tand ‘become’ the parameters of a Gaussian likelihood function
p(y_i|θ_i)= l(z_i,t|Θι,t). The paper defines those parameters with the greek letter theta
θ = (μ, σ). Don’t worry if you don’t understand this part — we will explain it later in more detail.
- In other words, the model tries to answer this: what are the best parameters
σthat construct a gaussian distribution which outputs predictions as close to the target variable
- This concludes the training step
t. The current target value
z_iand hidden state
hi,tare passed to the next time step and the training process continues. Since DeepAR trains (and predicts) a single data point each time, the model is called autoregressive.
The steps for inference are pretty much the same.
One thing changes though: Now, at each inference step
t, we use the predicted variable
ž_i,t-1 that was sampled in the previous time step
t-1 to calculate the new prediction
ž_i,t are now sampled from the gaussian distribution that our model has learned during training. However, our model does not learn the parameters
We will see how those parameters are calculated in the next section.
Before delving into how DeepAR’s autoregressive nature works, it is important to understand how the likelihood function works. If you are familiar with this concept, you can skip this section.
The goal of maximum likelihood estimation is to find the optimal parameters of a distribution that better explain our sample data.
Let’s assume our data follow the gaussian(normal) distribution. Each gaussian distribution is parameterized by the mean
μ and standard deviation
σ, that is
θ = (μ, σ). Hence, the gaussian likelihood ℓ, given
θ = (μ, σ) is defined as:
Now, take a look at Figure 4:
We have the green and orange data points, each following a different Gaussian distribution. Let’s assume you are given those data points and your goal is to estimate their two gaussian distributions.
More formally, the task is to find the best
σ of the two distributions that optimally fit those data (DeepAR assumes only one distribution). In statistics, this task is also called maximizing the gaussian log-likelihood function:
The function is maximized for all timesteps
N being the total number of time-series in our dataset.
In statistics, the parameters
σ are normally estimated using the MLEformulas (maximum log-likelihood estimators) that are derived by differentiating the likelihood function.
We don’t do that here.
Instead, we let the LSTM and 2 Dense layers derive those parameters based on the model’s input. This process is shown in Figure 5:
The process of estimating
σ is straightforward:
- First, the LSTM calculates its hidden state
hi,tpasses through a dense layer
W_μto calculate the mean
- Likewise, the same
hi,tpasses through a second dense layer
W_σand calculate the mean
- Now we have the
σ. The model creates a gaussian distribution with those parameters and takes a sample. Then, the model checks how close this sample is to the actual observation
- That concludes training for the time step
t. The LSTM weights and the 2 Dense layers
W_σare trained during backpropagation.
In other words, DeepAR computes
σ indirectly through
W_σ . This is done to make their calculation possible through backpropagation.
During inference, we do not have a target variable
z_i,t to compare. DeepAR has already learned all neural network weights and uses them to create the prediction
That’s it! We have now seen how DeepAR works end-to-end.
In the following sections, we will explain a few more mechanisms of DeepAR.
Note: The estimated mean and standard deviation parameters are formally symbolized in statistics with
Dealing with multiple heterogeneous time-series is tricky.
Imagine a product sales forecasting scenario: One product may have sales in the order of hundreds, while a different product can have sales in the order of millions.
This tremendous difference among time-series with different magnitudes could potentially confuse the model. To overcome this, DeepAR introduces an auto-scaling mechanism. More specifically, the model calculates an item-dependent
ν_ι to rescale the autoregressive inputs
z_i,t . This is given from the following formula:
Hence, at each time step
t, the autoregressive inputs
z_i,t from the previous step are first scaled by this factor.
Note: The auto-scaling mechanism of DeepAR works very well. However, in practice, it is preferable to manually normalize our time-series first. Doing this will enhance our model’s performance.
DeepAR in the Time-Series Landscape
In this section, we discuss how DeepAR competes with other models as well as its limitations.
The authors showed that DeepAR outperformed traditional statistical methods such as ARIMA. Also, the great advantage of DeepAR over those models is that it does not require extra feature preprocessing (e.g., making the time-series stationary first).
Amazon later released an updated version, called DeepVAR, which significantly improves performance. We will describe this model in a future article.
Deep Learning models
Since DeepAR was released, the research community has published numerous deep-learning models for time-series forecasting.
Not all of them can be directly compared to DeepAR because they work differently. To the best of my knowledge, the closest one that I can think of is Temporal Fusion Transformer (TFT) .
Let’s discuss two notable differences between DeepAR and TFT:
1. Multiple Time-Series
DeepAR calculates a separate embedding for each time-series. This embedding is then used as a feature for the LSTM and helps DeepAR to distinguish the different time-series.
TFT also utilizes LSTMs and works similarly. However, TFT uses those embeddings to configure the initial hidden state
h_0 of the LSTM. This approach is much better because TFT properly conditions the LSTM cell on each time-series without altering the temporal dynamics.
2. Type of Forecasting
TFT is not an autoregressive model — it is classified as a multi-horizon forecasting model. Both types of models can output multi-step predictions. However, multi-horizon forecasting models produce predictions in one go, instead of providing them one by one like autoregressive models do.
The advantage of this approach is that multi-horizon forecasting models can create predictions for time steps for which their covariates don’t have any values. TFT excels in this category, as it is one of the most versatile models in terms of feature variety.
DeepAR is a remarkable Deep Learning model that constitutes a milestone for the time-series community.
Also, this model is prevalent in production: It is part of Amazon’s GluonTS  toolkit for time-series forecasting and can be trained on Amazon SageMaker.
In the next article, we will use DeepAR to create an end-to-end project.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot