Transformers oversimplified

Original Source Here

Transformers oversimplified

Understanding the raging Transformers network architecture

Deep learning has kept evolving throughout the years. And that is an important reason for its reputation. Deep learning practices highly emphasize the use of large buckets of parameters to extract useful information about the dataset we’re dealing with. By having a large set of parameters, it becomes easier to classify/detect something as we have more data to identify distinctly.

One notable milestone in the journey of Deep Learning so far, and specifically in Natural Language Processing, was the introduction of Language Models that highly improved the accuracy and efficiency of doing various NLP tasks.

A sequence-sequence model is an encoder-decoder mechanism-based model that takes a sequence of inputs and returns a sequence of outputs as result. Consider image captioning in which for a given image, we create a caption. In that case, the seq-seq model takes the image pixel vectors (sequence) as the input and returns the caption word by word (sequence) as the output.

Some of the important DL algorithms that boosted the training of such models include Recurrent Neural Networks, Long Short Term Memory and Gated Recurrent Units. But over time, these algorithms’ usage faded due to the complexities and some of the disadvantages that highly affected performance with an increase in the size of the dataset. Some of the important disadvantages include longer time for training, vanishing gradients problem that losses information about old data as we train the model further for large datasets, the complexity of the algorithm etc.

Attention is all you need

One of the outbreaking concepts that replaced all the above algorithms in terms of language model training was the Multi-headed attention based transformer architecture. The transformer architecture was first introduced by Google in its paper “Attention is all you need” in 2017. The main reason it became so popular was because of the parallelization introduced by its architecture. The transformers leveraged the highly powerful TPUs and parallelized training lead to reduced training time.

The transformer architecture looks closer to this.

Just kidding, but it would be really cool to see such a visualization of the transformer coming together. But in fact, it has cool architecture.

Funnily, the entire architecture won’t even fit inside your screen even by abstracting it this much. But still, there is so much hiding under the hoods of each layer. But we’re not getting into the nitty-gritty in this post. We just want to know each layer and what it does in an overview.

The transformer has an encoder-decoder model for the seq-seq model with the inputs on the left and outputs on the right. It internally uses the attention mechanism to have become the paramount algorithm for language models.

[ Note: If you don’t feel like going through the entire details, I suggest you jump to the “The overall picture” section which gives all the info in short. ]

Now as I go through explaining each layer, we’ll use the language translation task example with the simple sentence “I am a student” and its translated form in French, “Je suis étudiant”.

Embedding layer

The input embedding is the first step in both the encoder and decoder sides of the transformer. The machines can’t understand words of any language. It eats only numbers. So we get the embeddings for each word in the input/output that are readily available in places like GloVe. To this embedding value, we add the positional information of that word (different values based on occurrence in odd or even positions) in the sentence to give contextual information.

Multi-Head Attention

The multi-head attention layer consists of multiple self-attention layers embedded together. The main purpose of the attention layer is to gather information about the relevance of each word to other words in the sentence and help to perceive the meaning easily. The above illustration depicts how each word in our sentence has a dependency on the other words to provide meaning. But it is not so easy to make machines understand that dependency and relevance.

This is where attention comes into the picture. In our attention layer, we take three input vectors, namely, the query (Q), the key (K) and the value (V). Consider the query is like something you search on browser, and browser has a set of pages to match, which are the keys and the result we get is the value. Similarly, for a given word in the sentence(Q), for the other words in it(K), we get the relevancy and dependency of it(V) on the other word. This self-attention process is done several times with different weight matrices for Q, K and V. And thus the multi-headed attention layer.

This is the hundred feet abstraction of the attention layer. As a result of the multi-headed attention layer, we get multiple attention matrices.

In the architecture, we can see we have two other attention layers in the decoder.

Masked multi-head attention

This is the first layer of attention in our decoder end. But why is it a *masked attention*?

Because, in the case of output, if the current word can have access to all the words coming after it, then it won’t learn anything. It’ll directly go ahead and suggest that word for output. But by masking, we hide the words coming after the current word. So it’ll have the space to predict what word will make sense for the given word and sentence so far. It already has the embeddings of the current word and the positional information, so we let it make sense using all the words it has seen previously using Q, K and V vectors, and figure out the most probable next word.

Encoder-Decoder attention

The next multi-headed attention layer in the decoder end takes two inputs (K, V) from the encoder end and the other(Q) from the previous attention layer of the decoder. Now it has access to the attention values from the input and the output. Based on the current attention information from input and output it will do interactions between both the languages now and learn the relationships between each word in the input sentence to the output sentence.

Residual layer

These attention layers will return a set of attention matrices which will be added with the actual input and a layer/batch normalization will be performed. This normalization helps to smoothen out the loss surface so it’s easy to optimize while using larger learning rates.

Feed Forward Layer

In encoder block, the Feed Forward Net is a straightforward neural net that takes the averaged out attention values and transforms them into a form more digestible by the next layer. It can either be another encoder layer on top or pass on to the encoder-decoder attention layer in the decoder end.

Now in the decoder block, we have another feed-forward net that does the same job and passes the transformed attention values to the next decoder layer on top or a linear layer.

The magical moment happens in this layer. Since each word can be passed through the neural net independently with its attention values, we introduce the sweet parallelization at this point.

And because of this, we can pass all words in the input sentence at the same time and the encoder can process all of them parallelly and give out the encoder output.

Output linear layer and softmax probabilities

After all the decoder side processing is done, we have the post-processing layer with a linear layer and the softmax layer. The linear layer is used to flatten the attention values from the neural net into the size of all the words in the output language. After this we apply softmax, to find the probabilities of all the words, from which we take the most probable word. This is nothing but the probabilities that predict the next possible word as the output from the decoder layer.

The overall picture

Now let’s take a quick look at the overall process.


We take each word from the input sentence pass them parallelly. We take the word embeddings and we add the positional information to give context. Then we have the multi-headed attention layer that learns the relevance with other words resulting in multiple attention vectors. These are then averaged out and a normalization layer is applied to ease out the optimization. These are in turn passed to the feed forward network that transforms the values into a dimension readable by the next encoder on top or the encoder-decoder attention layer.


We have a similar pre-processing step of word embeddings and adding context. Then we have a masked attention layer that learns the attention among the output sentence’s current word and all the previous words it has seen and not allowing to the upcoming words. Then a layer normalization is done. Now, we take the output of the encoder layer for the key, value vectors to the next attention layer with the decoder attention values as the query. Now the actual interactions between the input and output language happen that leads to the algorithm’s better understanding of the language translation.

Then we have another feed forward network, which passes the transformed output to a linear layer that flattens the attention values. Then a softmax layer is used to get the probability of the next occurrence of all the words in the output language. From here, the word with the highest probability will be the output.

Stacking up of encoders and decoders

It is also effective to stack up encoders and decoders as it leads to better learning of the task and boosts the predictive power of the algorithm. In the actual paper, Google has stacked 6 encoders and decoders. But also make sure, it doesn’t overfit and makes the training process expensive.


Transformers have been revolutionary in the field of NLP since the day it was introduced by Google. It was used in the development of various language models including the highly praised BERT, GPT2 and GPT3 that outperformed the previous models in all the language tasks. And learning about the base architecture will certainly keep you ahead of the game.

Thanks for reading this! I hope this article gave you an idea of the overall architecture of transformers.

Please feel free to share your thoughts/comments. Let’s make coding fun together!


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: