Recent Developments and Views on Computer Vision x Transformer

Original Source Here

Transformer and Vision Transformer


First, I would like to explain the Transformer Encoder used in Vision Transformer, which is a model proposed in the paper “Attention Is All You Need. The title of the paper was provocative to those who had been using LSTM and CNN.

It is neither CNN nor LSTM, but a mechanism called dot-product Attention, and the model (Transformer) that builds on it has outperformed existing methods by a large margin.

Taken from [2], an overview of the Transformer model

There are three variables, Query, Key, and Value, in (dot-product) Attention that are used in the Transformer. Simply put, the system calculates the Attention Weight of the Query and Key words, and multiplies each Key by the Value associated with it.

dot-product attention

Multi-Head Attention, which uses multiple Attention Heads (in term of MLP, the “number of hidden layers” is increased), is defined as follows. The (Sigle Head) Attention in the above figure uses Q,K as it is, but in the Multi-Head Attention, each Head has its own projection matrix W_i^Q, W_i^K, and W_i^V, and the features projected by these matrices are used to create the Attention.

Multi head Attention. Top left image is taken from [2]

The Q, K, and V used in this dot product Attention are called Self-Attention when they are all derived from the same data. This is called Self-Attention when the Q, K, and V used in this dot product Attention are derived from the same data. In Transformer, the part of the Encoder and the part below the Decoder are Self-Attention. The upper part of the Decoder is not Self-Attention because the Query is brought from Encoder and the K,V are brought from Decoder. The following figure shows the example of attention weights. In this figure, the word “making” is used as a query and the Attention Weight for each key word is calculated and visualized. The each keys in attention heads are learning different dependencies. the words of “key” are colored in multiple ways to represent the Attention Weight of each head.

Quoted from [2], the weight of Transformer’s Self-Attention. And I add subtitle.

How Vision Transformer works

Vision Transformer is a model that applies Transformer to the image classification task, and was proposed in October 2020. The contents are almost the same as the original Transformer, but there is an ingenious way to handle images in the same way as natural language processing.

Vision Transformer architecture, quoted from [1].

The Vision Transformer divides the image into N patches of 16×16 size. Since the patches themselves are three-dimensional data, they cannot be handled directly by the Transformer, which handles language processing. Therefore, after flattening them, they do a linear projection to convert them into two-dimensional data. By doing so, each patch can be treated as a token (like a word), which can be input to the Transformer.

In addition, Vision Transformer uses a pre-training → fine-tuning strategy: Vision Transformer is pre-trained on JFT-300M, a dataset containing 300 million images, and fine-tuned on downstream tasks such as ImageNet. Vision Transformer is the first pure transformer model to achieve SotA performance on ImageNet. This was the beginning of a large increase in research on Transformer x Computer Vision.

Why is Vision Transformer so accurate?

Research on Transformer x Computer Vision has been around for a long time, but it has not been able to achieve SotA performance on ImageNet. The authors interpreted the reason for this in terms of the inductive bias of the model and the number of data. Inductive bias is an assumption that the model has about the data. For example, CNNs process data with 3×3 kernels, which is based on the data assumption that data information is locally aggregated. In RNNs, data at the current time is highly correlated with data at the previous time, but data at the previous time is correlated only through data at the previous time. In RNN, the current time data is highly correlated with the previous time data, but the previous x2 time data is only correlated through the previous time data. This process is also based on the data assumption that data is highly correlated with the previous time. On the other hand, since Self-Attention only correlates each data, it can be said that its inductive bias is relatively low compared to CNN and RNN.

(Left) CNN, which has a strong inductive bias that the information is locally aggregated. (center) RNN, which has a strong inductive bias in that it is strongly correlated with the previous time (right) Self-Attention, which has a relatively weak inductive bias because it only correlates all features.

The authors interpret strongness of ViT as follows: “In situations where there is little data, models with a strong inductive bias are stronger than those with a weak inductive bias because they have assumptions about the data. On the other hand, when there is a lot of data, the assumption becomes a hindrance, so the model with a weak inductive bias becomes stronger in situations with a lot of data”. The following figure reinforces this interpretation: Vision Transformer and CNN are compared by the size of the pre-training dataset. In the case of pre-training with JFT-300M, it outperforms CNN (model with strong inductive bias).

data amount and accuracy


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: