Recurrent Neural Networks.

Original Source Here

Recurrent Neural Networks.

Recurrent neural networks mimic the working of the brain. Just as we choose words according to context, subject, gender, and situation, RNNs also classify them and predict words based on the previous input/context. RNNs are fun to learn. They have pretty amazing applications as well.

We will divide this into three sections:

  1. What is RNN and why do we need it?
  2. What is Long Short Term memory and how it works?
  3. Applications of RNN.

We will first talk about the applications to know the context behind the next sections. Once you understand where RNNs are used, it is easy to learn the concept.

Applications of RNNs:

RNNs are mainly used for text prediction. RNNs understand the context of the text. They can even learn to spell!! RNNs can also be used in speech recognition, conversion of speech to text, and vice versa. Other applications include text summarization, video tagging, translation, voice search, and many many more. Basically, it can be used for text, audio, video, speech, and images too.

Recurrent Neural Networks:

credits: Wikipedia

RNNs were introduced to overcome the limitations of feedforward networks. Feedforward architectures cannot predict based on the previous data as it is unidirectional, also it can only take fixed-size data and gives fixed-size outputs. RNNs don’t have this problem. RNN is an artificial neural network with a feedback architecture, meaning the output of the network is fed back to the hidden layers of the network. RNNs use a backpropagation algorithm. Backpropagation uses gradient methods to update the weights of the inputs. They remember the previous prediction and use these predictions to change the weights of the input to predict the next word or outcome better. RNNs consider all the data in the form of a matrix as that is the language they are made to understand.

Let us take an example to understand better. Say that you are planning to study 3 subjects (one subject per day). You have made an order in which you want to study. The subjects are math, science, and English literature. The order is math -> science -> English literature -> math -> science on and so forth. So, the prediction depends on what you studied the previous day. This is not possible through a feedforward network as it doesn’t remember any of the previous predictions. This is where RNN comes into the picture. RNN has a memory that stores the previous predictions. So when you send math as input, you get science as output. Similarly for science and English literature. You missed studying someday, so you will have one input as predicted yesterday and another one that you actually studied the previous time (because you missed a day). So, the RNN has to remember what it predicted for the previous day and then predict for today.

Let us take another example. These sentences are about the ocean food chain, “Sharks feed on tuna fishes.”, “Tuna fishes feed on small fishes.”, “Small fishes feed on sea plants.”, “Sea plants feed on micro-organisms.” and “Micro-organisms feed on sharks.”. Now let’s say you feed in this data to RNNs. Then you send in the input ‘sharks’ and the model is supposed to make a sentence that makes sense based on the given data. RNN will understand that as there is a noun and the possibilities would be ‘feed’ or a period (.). But if it’s a ‘.’ then the sentence would just be “sharks.” which doesn’t make any sense. And other unwanted sentences that can be formed are “Sharks feed on sharks.”, “Sea plants.”, “Sea plants feed on sharks.”, etc. This happens because backpropagation only cared about nouns or verbs but lacked the understanding of the significance of each possibility concerning the input. There are also few other problems due to backpropagation:

  1. Vanishing Gradient: Backpropagation uses gradient methods to update the weights. But sometimes the gradient is so small that with it being updated the predictions don’t change. It is almost like it vanishes. It occurs as more layers are added to the network, the loss function almost becomes zero. Hence making the network very difficult to train.
  2. Exploding Gradient: As the name suggests it is the opposite of vanishing gradient. Here, the gradient is so huge that large changes are made in the model. This makes the model unstable and becomes unable to train model. So, it is important to update the gradient with the right amount for it to train in the right direction.

These problems can be solved using Long Short Term Memory aka LSTMs.

Long Short Term Memory (LSTM):


LSTMs are used for long-term dependencies in contrast to basic RNNs which have short-term dependencies. Standard RNNs have a simple structure with one tanh layer which is repeated over and over again. But LSTMs have four sigmoid layers that are looped (tanh and sigmoid are activation functions). LSTMs have a cell state (as shown in the figure above) which collects the information from each layer to get the output and it is regulated by gates. Gates are a combination of sigmoid function and pointwise multiplication operation. The outputs of the sigmoid function are between 0 and 1. LSTMs have 3 of such gates. LSTMs have four main steps, we will discuss each step in detail:

The following steps are explained with text prediction as the example to make it easier to understand.

Step 1: (Forgetting step)

This step is called the forgetting step as its main job is to decide whether or not a particular information needs to be thrown away. It is done by a sigmoid layer and this gate is called the forget gate. The inputs of this gate are h(t-1) and x(t). The output [f(t)] of this gate is either 1 or 0. 1 means not to throw away the information, 0 means to throw away the information. This gate helps in remembering the gender of the current subject, so the right pronouns can be used. And when a new subject is received, the old subject is thrown away.

Step 2:

This step decides which information to store in the cell state. This involves two gates. One is a sigmoid gate called the input gate layer. This gate decides the values which need to be updated.

The second gate is tanh, it creates a vector for the new candidate key.

Then these two gates are combined to create an update to the state.

Step 3:

Now cell state has received the information from the previous steps. It will forget the states to forget, update the information to update, and combine these two to get c(t).

Step 4:

This is the final step. It has two layers tanh and sigmoid. Here, the output is decided. Sigmoid is used to select what parts of the cell state to provide in the output. Then it is passed through tanh and the outputs from these two gates are multiplied.

Hence, the output is obtained.


  1. Understanding LSTM Networks (highly recommended)

2. The Unreasonable Effectiveness of Recurrent Neural Networks

3. Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM) by Brandon Rohrer


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: