Transformers: An Overview of the Most Novel AI Architecture

Original Source Here

Machine Learning Theory

Transformers: An Overview of the Most Novel AI Architecture

The Model Responsible for your Siri, Alexa and Google Home

Image by Author

Transformers have revolutionized the field of natural language processing, computer vision and image generation. In this article I break down this this novel architecture and explain how it works.


In the last few years, the newest generation of massive AI models have produced extremely impressive results. Just like in the past decade deep learning has revolutionized a wide variety of sectors, this new generation of machine learning models have immense potential. Models like GPT-3, and DALL-E, which rely on Transformers to function, have sparked new products, services and businesses that will add immense value to society.

Transformers are the building blocks of these models. GPT-3, arguably the most advanced natural language processing (NLP) model to date, is able to accomplish a wide variety of NLP tasks in a human like way. Its architecture is a standard Transformer, what sets it apart is its unprecedented size of 175 billion parameters. The efficiency and simplicity of the Transformer is what has allowed for such massive models to become feasible. In this article, I want to discuss how these work and why they are so important.

What Sets Transformers Apart

Transformers are a machine learning model architecture, like Long Short Term Memory Neutal Networks (LSTMs), and Convolutional Neural Networks (CNNs). This new architecture has some advantages that has allowed Transformers to become the basis for the newest state of the art models.

As I have already alluded to, transformers can be truly massive. Larger models produce better results, but are more expensive to train. However, modern transformer model sizes couldn’t be achieve with any other architecture. A key difference between transformers and other architectures is that Transformers are highly parallelizable, making them very compute optimal, and allowing us to train extremely large models.

The second thing that sets transformers apart is their clever use of attention. To describe how they do this, I will focus on the original paper where transformers were introduced: Attention is All you Need.

Attention is All You Need

Paper by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, submitted on June 2017

Transformers consist of a simple architecture that uses attention cleverly. A transformer consists of an encoder and a decoder. In the encoder, the model first takes the sentence to translate, vectorizes it, and transforms it using attention. The decoder does the opposite, and goes from the vectorized transformation to the sentence. The decoding can output the result in a different language, for example, making them powerful in translation tasks.

I’ll now describe how Attention works, then how Multi-Head Attention works, and finally I’ll talk about how a Transformer uses these.


Attention is the key to Transformers and why they are such a strong architecture. Attention layers are very efficient, presenting lower complexity than their alternatives:

Table I: Complexity and maximum path length of kinds of layers (Image from Attention is All you Need [1])

Self-Attention is lower complexity and lower path length than its counter parts. n represents the sequence length and d is the representation dimension (d is often larger than n, where d in the paper was taken as 512).

How does an Attention Layer Work

Attention layers are found in the encoder and the decoder part of the Transformer. Later on, I’ll describe where these layers fit in within the Transformer architecture.

Prior to reaching the the attention layer, the input text is first tokenized to representative embeddings. From each embedding, the encoder will produce representative encodings using attention, and the decoder will do the opposite, outputting text.

Image by Author

The first stage of the Attention layer consists of 3 neural networks. These take the embedded words in the sentence as input, and for each word, produce 3 output vectors: Queries, Keys and Values. Doing this for every word in the sentence results in n Queries, Keys and Values.

The second stage consists of calculating the score of each word. Each word will have n scores, which are then cleverly combined. This allows the model to take into account all other words in the sentence to understand each word’s meaning.

Looking at the first word in a sentence, it will be assigned a query, a key and a value. We want to calculate the score of all other words on this first word. To do so we first take the dot product of the first word’s query vector and all the key vectors of the sentence. The dot product here can be thought of as a measure of similarity. If the query and key vectors are aligned, their dot product will produce a large value. If the query and key vectors are close to orthogonal, the result of the dot product will approach zero.

Score weights for Word 1 (Image by Author)

In the image above, the dot product between the query from the first word, and all keys in the sentence, are used to produce the scores.

The third step is to normalize and apply SoftMax to these values (making their sum equal to 1)

Just a quick recap: Attention starts of with 3 neural networks, outputting Queries, Keys and Values for each word. So far we have used the Queries and Keys vectors to produce a set of weights for each word, which ranks how much attention we should pay towards each word when looking at a particular one.

The fourth step is to use the Value vectors, and scale them using the scores computed in the previous steps. Finally we sum up the resulting scaled value vectors, giving the attention vector for each word.

Attention Vector for Word 1 (Image by Author)

In the image above, the score for each word based on query 1 are used to scale the values. These are then summed to produce the final vector.

This resulting vector is the resulting attention vector for Query 1 (from the first word). You would do this for every query, resulting in n vectors, and those are the outputs of the attention layer. Note that each resulting vector only depends on the Query for that word, but also depends on the Keys and Values of all the words in the sentence. This is what makes Attention powerful in sequential tasks, as you’re able to embed context about the whole input into each representation.

I now want to show the diagram from the paper illustrating self-attention.

Image from Attention is All you Need [1]

As you can see, the Queries and Keys are used to produce the scores, which are then applied to the Value vectors to produce the final output vector.

Multi-Head Attention

So far we have looked at a single-head attention layer. A multi-head attention layer is an expansion of the single-head attention that allows faster computation.

For single-head attention, the embedding vectors for each word have 512 dimensionally. In multi-head attention, the Query, Key and Value vectors are split into 8 heads (64 dimensional vectors), and the attention layer is applied onto each head the same way as I showed before. The resulting attention vectors for each head are then concatenated together to form 512 dimensional attention vectors.

Image from Attention is All you Need [1]

Multi-head attention is merely an expansion to the attention layer I described in detail previously, and you don’t need to understand it perfectly to understand the role of attention in the larger context of a Transformer.

The Transformer

Now that you understand what attention layers looks like, I’ll finish by describing the transformer architecture:

Fig. 1: Transformer model architecture (Image from Attention is All you Need [1])

The Transformer is composed of the Encoder (left block) and the Decoder (right block)

Encoder: The encoder is the block on the left. This block consists of a stack of N identical layers (in the paper N = 6). Each layer contains a multi-head attention layer, followed by a fully connected feed forward neural network. The output of each encoder layer is used as an input to the subsequent encoder layer. All layer outputs are the same dimensionality (512, just like the embedding).

Decoder: The decoder stack also consists of 6 identical layers. Each decoder layer has 2 multi-head attention layers, followed by a feed forward neural network. In each decoder layer, the input to the first attention layer is the output from the previous decoder. The input for the second attention layer is the output from the encoder stack.

Training and Results

The Transformer was trained on an English to German and English to French translation tasks and achieved new state-of-the-art in both problems. The model was trained for days using 8 GPUs. The models achieved better results than any previously published results, at a fraction of the training cost.

The results were groundbreaking, and since, Transformers have since been used in many of the most advanced models this industry has been able to produce.


In this article I explain how Transformers work. I take the original Transformer paper (Attention is all you need) and break it down into simple understandable steps. The transformer has changed the game in Natural Language Processing tasks. What sets it apart from other models is its creative use of attention, and the fact that it is highly parallelizable, making it very efficient to train. In future articles, I look forward to discussing how the transformer has been integrated in other complex model architectures, and how they have revolutionized industries such as image and sound generation.

Support me

Hopefully, this helped you, if you enjoyed it you can follow me!

You can also become a medium member using my referral link, and get access to all my articles and more:

Other articles you might enjoy


[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, “Attention Is All You Need”, NIPS, 2017. Available:

Other Useful Links

Amazing illustrations for transformers:

And this great video by AI Epiphany:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: