Original Source Here
Transformer and Vision Transformer
First, I would like to explain the Transformer Encoder used in Vision Transformer, which is a model proposed in the paper “Attention Is All You Need. The title of the paper was provocative to those who had been using LSTM and CNN.
It is neither CNN nor LSTM, but a mechanism called dot-product Attention, and the model (Transformer) that builds on it has outperformed existing methods by a large margin.
There are three variables, Query, Key, and Value, in (dot-product) Attention that are used in the Transformer. Simply put, the system calculates the Attention Weight of the Query and Key words, and multiplies each Key by the Value associated with it.
Multi-Head Attention, which uses multiple Attention Heads (in term of MLP, the “number of hidden layers” is increased), is defined as follows. The (Sigle Head) Attention in the above figure uses Q,K as it is, but in the Multi-Head Attention, each Head has its own projection matrix W_i^Q, W_i^K, and W_i^V, and the features projected by these matrices are used to create the Attention.
The Q, K, and V used in this dot product Attention are called Self-Attention when they are all derived from the same data. This is called Self-Attention when the Q, K, and V used in this dot product Attention are derived from the same data. In Transformer, the part of the Encoder and the part below the Decoder are Self-Attention. The upper part of the Decoder is not Self-Attention because the Query is brought from Encoder and the K,V are brought from Decoder. The following figure shows the example of attention weights. In this figure, the word “making” is used as a query and the Attention Weight for each key word is calculated and visualized. The each keys in attention heads are learning different dependencies. the words of “key” are colored in multiple ways to represent the Attention Weight of each head.
How Vision Transformer works
Vision Transformer is a model that applies Transformer to the image classification task, and was proposed in October 2020. The contents are almost the same as the original Transformer, but there is an ingenious way to handle images in the same way as natural language processing.
The Vision Transformer divides the image into N patches of 16×16 size. Since the patches themselves are three-dimensional data, they cannot be handled directly by the Transformer, which handles language processing. Therefore, after flattening them, they do a linear projection to convert them into two-dimensional data. By doing so, each patch can be treated as a token (like a word), which can be input to the Transformer.
In addition, Vision Transformer uses a pre-training → fine-tuning strategy: Vision Transformer is pre-trained on JFT-300M, a dataset containing 300 million images, and fine-tuned on downstream tasks such as ImageNet. Vision Transformer is the first pure transformer model to achieve SotA performance on ImageNet. This was the beginning of a large increase in research on Transformer x Computer Vision.
Why is Vision Transformer so accurate?
Research on Transformer x Computer Vision has been around for a long time, but it has not been able to achieve SotA performance on ImageNet. The authors interpreted the reason for this in terms of the inductive bias of the model and the number of data. Inductive bias is an assumption that the model has about the data. For example, CNNs process data with 3×3 kernels, which is based on the data assumption that data information is locally aggregated. In RNNs, data at the current time is highly correlated with data at the previous time, but data at the previous time is correlated only through data at the previous time. In RNN, the current time data is highly correlated with the previous time data, but the previous x2 time data is only correlated through the previous time data. This process is also based on the data assumption that data is highly correlated with the previous time. On the other hand, since Self-Attention only correlates each data, it can be said that its inductive bias is relatively low compared to CNN and RNN.
The authors interpret strongness of ViT as follows: “In situations where there is little data, models with a strong inductive bias are stronger than those with a weak inductive bias because they have assumptions about the data. On the other hand, when there is a lot of data, the assumption becomes a hindrance, so the model with a weak inductive bias becomes stronger in situations with a lot of data”. The following figure reinforces this interpretation: Vision Transformer and CNN are compared by the size of the pre-training dataset. In the case of pre-training with JFT-300M, it outperforms CNN (model with strong inductive bias).
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot