A visual guide to LSTM and GRU

Original Source Here

In our previous article, we saw how an RNN works and briefly touched upon the fail case of simple RNN.In this article we will take a deep dive into the problem of short-term memory and how we can increase this short-term memory using Long Short Term Memory (LSTM) and Gated Recurrent Units (GRU).

Suppose we have a sentence.

The dog which ran away from home has returned.

Whether the word ‘has’ would be singular or plural depends on the word dog. But the word ‘dog’ is far from ‘has’. Let us see how this distance affects

But for that first, let us look at the tanh activation function and range of its derivative.

The derivative of tanh has an upper bound at 1.

Now let us look at our original problem.

Our gradient calculation looks like…

But a⁸ is…

So using above values of above a⁸, our gradient becomes…

Repeated multiplication of values less than 1 makes the gradient term very small. As the sequence length increases the possibility of gradient getting small also increases. So during updating of weights, the gradient becomes so small the weights don’t get updated. Let us see how LSTMs tackle this problem.


LSTM architecture has gates using which regulate the flow of information. If information is relevant it easily gets forwarded. Let us look at the architecture.

⊗ : Element-wise multiplication

⊕: Element-wise addition

The above architecture might look intimidating but we will break it down piece by piece. In contrast to my previous article, we will denote the hidden state with the letter ‘h’. Unlike simple RNN, we have an additional input called the cell state denoted by ‘c<t>’. Cell state can be thought of as the memory of the entire network. If a piece of information is useful then the cell state will be storing it. Whether information is useful or not is decided by various gates in LSTM. These gates can learn and unlearn information based on its relevance.


Each gate in the above LSTM is a sigmoid activation. When we input a value into sigmoid activation it generates values within 0 to 1. Each gate has learnable parameters which upon learning are used to decide whether information should pass through the gate or not.

Forget gate

The forget gate takes an input of previous hidden state values (h<t-1>) and current input (x<t>) and along with it, it has learnable parameters wf. Together they become input to a sigmoid activation which scales the value between 0 and 1. Now when the output of this sigmoid activation gets multiplied with the cell state value of the previous time step it acts like a gate, a gate that controls the flow of how much of previous information shall we allow to pass and how much we shall forget.

Input gate

Again like Forget gate, the Input gate also takes previous hidden state values (h<t-1>) and current input (x<t>) and along with it, it has learnable parameters wi which when fed to sigmoid activation generate values between 0 and 1. But the role of the input gate is different. It generates a value between 0 and 1 based on how much new information shall be added to the current cell state. For this, we multiply the candidate value which is nothing but the tanh scaled value of the previous hidden state (h<t-1>) and current input (x<t>).

Cell state

Now we can easily calculate the cell state of the current time step. The output of forget gate (Гf) is a vector of the same shape as c<t-1> and the output of the input gate (Гi) is a vector of the same shape as č<t>. So we can do an element-wise multiplication of both and doing an element-wise addition gives us the current cell state. Imagine if the new information is completely irrelevant and we don’t need to remember then the input gate will give a value close to 0 and forget gate gives a value close to 1. In this case, c<t>≅c<t-1> and previous state information can easily pass without getting diminished.

Theoretically, it is possible for info at the first time step to reach the last time step even in case of a long sentence if forget gate gets a value close to 1 and input gate a value closer to 0.

Output gate

Like all the previous gates it also gets the same input. When the output gate value is multiplied by the current cell state value it generates the current hidden state value.

Now you should have some idea about the working of an LSTM. Let us look at a comparatively newer network solving the same problem.


Gated Recurrent Units (GRU) try to achieve what LSTM does with the use of only 2 gates. It is similar to simple RNN. It has a hidden state and no cell state like LSTM. Let us have a look under the hood architecture of a GRU.

Reset gate:

This gate controls how much information from previous time steps to forget.

Update gate

The update gate decides how much information of previous time steps is to be passed on to the next time step. If a piece of information is important, this gate will allow it to pass through without reducing its value.

Hidden state calculation

The reset gate(Гr) decides how much previous time step value to forget. It rescales the h<t-1> value between 0 to 1 accordingly. This value is concatenated with current time step x<t> and passed through a tanh activation to generate a candidate value. Here this value is similar to the candidate key in LSTM and hence we are using such a nomenclature. Then the update gate controls how much this new information is important and generates a value between 0 to 1.

If this new information is very important update gate (Гu) will get a value close to 1. So most of the information should pass on to the next time step. But at the same time, much of the previous hidden state information shall be blocked from passing to the next time step due to 1-Гu multiplication with the previous hidden state value.

If you look carefully you can find the connection between the update gate in GRU with input and forget gate in LSTM. Usage of fewer gates reduces the number of parameters and speeds up the computation. Also, we do not apply any nonlinearity for generating the final output.

From past observations both GRU and LSTM yield comparable results. GRUs do the same task as an LSTM with fewer parameters and might take less time too. But with enough data, due to the better expressive power of LSTM, it might provide better results.

In this article, we saw how LSTM and GRU help in alleviating the problem of short-term memory. Please go through the below link if you want, where the above concepts are explained with the help of code.

This article is heavily inspired by the following articles. You can check them out too…

LSTMS and GRU are still used for many sequence modelling tasks. I hope this article helped you understand LSTM and GRU. You can connect with me at: https://www.linkedin.com/in/baivab-dash

Have a nice day …!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: