Overview Of Vision Transformers Is All You Need

Original Source Here

Overview Of Vision Transformers Is All You Need

Photo by Joshua Earle on Unsplash

In the history of transformers in deep learning, everything started with the famous paper ‘Attention Is All You Need in 2017. Google Brain team has published their research which changes the destiny of Natural Language Processing(NLP) by using Transformer.

The idea of using the same technique on images may have opened the door to a new era in vision technology…


  • In this post, I have prepared a general overview of vision transformers from my research. During my learning process, I summarized some notes to answer these questions to understand transformers.

— — What is Transformer, and why is it used in deep learning?

— — Why and How is Transformer applied to vision?

— — What are the differences between Vision Transformers and CNNs?

— — How is Transformer used in object detection?

— — Are Vision Transformers ready for production?

  • These notes would be useful, especially for the ones who have heard transformers a lot but could not start yet. Let’s start…

What is Transformer, and why it is used in deep learning?

  • Transformer, a new model architecture based on the attention mechanism, was first introduced in the paper ‘Attention Is All You Need’ by some researchers and Google Brain Team in 2017.
  • Until that point, Recurrent Neural Networks(RNN), Long Short-Term Memory(LSTM), and Gated Recurrent Networks were mostly used as state-of-the-art networks in Natural Language Processing applications because of their capability to find information on sequence data. However, these networks have some major drawbacks in NLP tasks.
RNN Archtiecture. Source:https://stanford.edu/~shervine/teaching/cs-230/cheatsheet-recurrent-neural-networks
  • They are not enough to understand global meaning from text data because of its sequential processing. As you can see in the image of traditional RNN architecture above, the output of each timestamp has to be given to the next timestamp as an input. This causes short-term memory and also prevents parallel training. Even though LSTM could increase memory capacity, it is not enough and is still computationally expensive.
  • On the other hand, Transformer and especially its attention mechanism created breakthroughs in NLP tasks. OpenAI’s GPT-3 models and Google’s BERT model are the biggest examples of this breakthrough.
BERT example by Google. Source: https://blog.google/products/search/search-language-understanding-bert/
  • The main part of the Transformer architecture consists of an encoder and a decoder. Transformers take all the input, like a sentence, as embeddings at once, while RNN and LSTM models take input word by word. and these embeddings are processed in attention blocks.
  • The purpose of the attention block is to extract the relationship/dependency between words of the input sentence. Therefore, these feature helps to get a global understanding of the context. In addition to attention mechanisms and embeddings, LayerNorm, feed-forward networks, and softmax functions are also used in this network.
The Transformer Model Architecture. Source:https://arxiv.org/pdf/1706.03762.pdf
  • I’m not going to explain all the details about the model architecture, but I should mention that attention blocks are the most crucial part of Transformers. We’ll see the same architecture in more detail in the next part. The model architecture of the Transformer can be seen above. Further explanations about the model can be found in the official paper of the study.
Attention visualization. Source: https://arxiv.org/pdf/1706.03762.pdf
  • In the example of the attention mechanism above, you see long-distance dependencies in the encoder part of the network. All the attention you see here is shown only for the work ‘making’. Each different colors also represent different attention heads. Impressive to see that many of the attention heads attend to make a relationship between ‘making’, ‘more’, and ‘difficult’.

Why and How Transformer is applied to Vision?

  • As I mentioned, the attention mechanism is the heart of Transformers. This made Transformers a de-facto standard for NLP tasks. Going from local to global understanding…like humans.
  • In 2020 Google Research and Brain team used almost the same technique on the image, and they showed that reliance on CNNs is not necessary and a pure Transformer applied directly to sequences of image patches can perform very well on image classification tasks. They published their research in the paper named ‘An Image Is Worth 16×16 Words’.

The Vision Transformer model can be examined below. Let me explain how this model works.

Model overview. Source: https://arxiv.org/pdf/2010.11929.pdf
  • The first step of the whole process is splitting an image into fixed-size patches, flattening them, and then linearly embedding each of them. You can imagine these patches as words and the whole image as a sentence in NLP.

You might ask why we do not take all the pixels as patches. The first answer to this question is related to calculation complexity. The complexity of the attention mechanism would be very high if we took each pixel as a patch. Also, pixels at the different corners or sides of the image have no meaningful relationship in each other. Thus, it would be unnecessary to make extra attention by the network.

  • The next step is adding position embeddings to patch embeddings. Then, the Transformer encoder is fed by the resulting sequence of embedding.

Before talking about the Transformer encoder, I would like to mention one point. If you realize that we combine patches and their positions. However, there is no information about the position of pixels.

There is research that tries to solve this problem by dividing patches into smaller patches. Moreover, the authors of this research have achieved an 81% top-1 accuracy on the ImageNet. which is about 1.7% higher than state-of-the-art Visual Transformers with a similar computational cost. They call their framework as Transformer-iN-Transformer(TNT). You can see their model below and check their paper for details.

TNT framework. Source:https://arxiv.org/pdf/2103.00112.pdf

Let’s continue from the Vision Transformer encoder…

  • Transformer Encoder includes Multiheaded Self-Attention(MSA), MLP blocks, and Layernorm(LN).
  • The purpose of MSA is to extract attention between patches like in NLP. After each patch of the image, self-attention will evaluate the attention between this patch and other patches of the image. Don’t forget, we call this attention mechanism Multiheaded Self-Attention, and each head has one attention pattern.
  • Layernorm(LN) is used before every block, and residual connections are applied after every block.
  • MLP is used to implement the classification head of the architecture. MLP consists of one hidden layer at pre-training time and a single linear layer at a fine-tuning time.
Attention map examples. Source: https://arxiv.org/pdf/2010.11929.pdf
  • You can see how the attention mechanism works in the images above. Do you remember the example that displays how attention heads were working in NLP? Like in there, you see what we get after all attention heads are combined.

You can find the Google Brain team’s latest developments and models on their GitHub repository. Detailed implementations and explanations about the Vision Transformer can be found there.

What are the main differences between CNNs and Vision Transformers?

  • Vision Transformers have less inductive bias than CNNs on the image. In 2007, Geoffrey Hinton said that one of the main problems of CNN comes from pooling layers in the network. These layers cause to loss of important information from part of images and also from the whole image. This also leads to a loss of connection between different parts of images and makes a local understanding of images. However, the self-attention layers of the Transformer are global, and it brings global understanding to them.
  • According to the official paper of Vision Transformers, different comparisons can be observed between CNNs and Vision Transformers.
  • CNN works better with less amount of data with respect to Transformers, as you can see in the graphs below. The main reason comes from having this inductive bias. However, if Transformers can be fed by big amount of data, then they will bring better results with the global approach, contrariwise Cnn’s restricted capability because of local sensitivity.
Performance comparisons. https://arxiv.org/pdf/2010.11929.pdf
  • Also, when memory efficiency is compared, especially the large Vision Transformer models are more memory efficient than ResNet models.
Performance comparisons. https://arxiv.org/pdf/2010.11929.pdf
  • When pre-training computation performances are compared for different architectures, Vision Transformers generally surpass ResNets with the same computational budget. However, the Hybrid model brings better results for smaller model sizes.
Total pre-trining compute[exaFLOPS]. Source: https://arxiv.org/pdf/2010.11929.pdf

Hybrid models are not explained in this post. I have only talked about DETR framework in the next part. As you can understand from its name, hybrid models are a combination of Transformers and CNNs. This topic will be the subject of the next post.

  • Another difference between these two approaches is that Vision Transformers are able to learn meaningful information even in the lowest layers. CNNs are able to extract high-level information in the last layers. These differences can be observed by comparing visualized attention maps from Transformers and weights from CNNs

Transformers in Object Detection

Object detection is one of the main computer vision tasks, and I think most used one by engineers and also by me. That’s why I inserted two famous object detection approaches that use Transformers in their pipeline.

DEtection TRansformer(DETR)

  • In 2020, Facebook published DETR. They have released the first object detection framework by using Transformer as a central building block in the detection pipeline. CNNs have also been used in this pipeline. In addition, the authors of this research have achieved competitive results compared to Faster R-CNN.
DETR pipeline. Source: https://arxiv.org/abs/2005.12872
  • Facebook’s DETR is a good example of hybrid model. It consists of CNN, Transformer encoder-decoder, and feed-forward networks.
  • Instead of splitting the input image as patches, CNN backbone is used to extract features of the image. These features are flattened and combined with positional encodings.
  • The transformer encoder takes this set of image features as a sequence. As we saw before, the Transformer encoder includes a multi-head self-attention module, normalizer, and feed-forward network. In this network, positional encodings are also fixed.
Encoder self-attention. Source:https://arxiv.org/abs/2005.12872
  • As you can see above, thanks to the attention mechanism, Transformer encoders are able to separate objects even in the last encoder layer of the model.
  • In the decoder part, the mechanism is almost the same as the original Transformer. The only difference is that this model decodes the ’N’ embeddings in parallel at each decoder layer. These embeddings come from the encoder part and are also referred to as ‘object queries’. These ’N’ objects are transformed into embedding and sent to the feed-forward networks.
  • Feed-forward networks are used for prediction problems. The outputs of embeddings that come from the Transformer decoder are sent to these networks. Then, they predict either a detection(class and bounding box) or a ‘no object’ class. You can imagine this class as a ‘background’ class in the standard object detection models.

Code and pre-trained models can be found in the official GitHub repository of this work.

You Only Look at One Sequence(YOLOS)

  • YOLOS is the optimized version of Vision Transformer for object detection tasks. Since this approach is not designed to be a high-performance object detector. Its performance is promising for future developments.

YOLOS architecture is very similar to the original Vision Transformer scheme as you can see below. You notice that there are ‘Pat-Tok’, ‘PE’, and ‘Det-Tok’.

  • ‘Pat-Tok’ defines the embedding of a flattened image patch. ‘PE’ represents positional embeddings, and ‘Det-Tok’ defines learnable embedding for object binding.
  • In the architecture, the first difference between Vision Transformer and YOLOS is that there are 100 randomly initialized learnable detection tokens(‘Det-Tok’) instead of using a learnable class token which is used for classification.
  • The body part of the architecture is the same as Vision Transformer encoder. Each Transformer encoder layer includes a multi-head self-attention block, layernorm, and MLP block as we talked about in the previous parts.
  • MLP heads are used for implementation classification and bounding box regression.
YOLOS architecture overview. Source:https://arxiv.org/pdf/2106.00666.pdf
  • The second difference between these two approaches is the loss function. While Vision Transformer uses image classification loss, YOLOS uses bipartite matching loss.
  • You can examine the self-attention map visualization of detection tokens and the corresponding predictions on the heads of the last layer of two different YOLOS-S models.
Self-attention map visualization. Source:https://arxiv.org/pdf/2106.00666.pdf

So, after all this information, I would like to share my thoughts about this question…

Are Vision Transformers ready for production?

For sure, there is no specific answer to this question. Solution techniques for any task can vary according to project requirements.

However, there are some basic points that are concerned before the production process of most of the projects, like inference time, accuracy, model training requirements, and deployment process.

So with respect to the differences between Transformers and CNNs, which one should be chosen for the production?

  • Since Transformers require a large amount of data for high accuracy, the data collection process can extend project time. In the case that having fewer data CNNs generally perform better than Transformers.
  • The training time of the Transformer looks less than CNNs. According to compare them with respect to computational efficiency and accuracy, Transformers can be chosen in the case that time for model training is limited.
  • The self-attention mechanism can bring more awareness to the developed model. Since it is so hard to understand the weaknesses of the model developed by CNNs, attention maps can be visualized, and they can help developers to guide how to improve the model. This process is harder for CNN-based models.
  • Last but not least, deployment of chosen approach should be straightforward and fast to get ready to be deployed (If you do not have time limits, no problem). Even though there are some frameworks for Transformers, CNN-based approaches are still less complex to be deployed.

As I said in the beginning, we can not say specific answers to this question. Hybrid models are also developed and performed well. The current situation of these approaches should be followed consistently. Project requirements and capabilities of different approaches should be considered before making decisions.

Since we live in a world where we have more and more data every day, and development never stops, Transformers will be more suitable to be deployed in real applications…


  • From image classification to image segmentation, Transformers became part of computer vision applications. We can add action recognition, image enhancement, super-resolution, or 3D reconstruction tasks to the Transformer’s list.
  • Undoubtedly, we’ll see well-performed, Transformer-based approaches in the future of visual technology as more data comes in.
  • I would like to conclude my post by mentioning the most crucial influence of this approach on me. CNNs were always at the center of my thoughts about the future of computer vision. As a student, maybe my horizon was not enough at that time. However, Transformers made me understand/remember new approaches will always come.

This evolution is so exciting, and I am so glad to be part of the AI revolution!

About me

  • I am a Machine Learning Engineer Trainee at Neosperience. I’m pursuing a Master’s degree in Data Science at Universita di Pavia.
  • Neosperience unlocks the power of empathy with software solutions that leverage AI to enable brands to understand, engage and grow their customer base. Reach out at www.neosperience.com.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: