Original Source Here
Global Context Vision Transformers — Nvidia’s new SOTA Image Model
Nvidia has recently published a new vision transformer, titled the Global Context Vision Transformer (GC ViT) (Hatamizadeh et al., 2022). GC ViT introduced a novel architecture that leverages both global attention and local attention, allowing it to model both short-range and long-range spatial interactions.
The clever techniques used by the Nvidia researchers enabled GC ViT to model global attention while avoiding expensive computations. GC ViT achieves state-of-the-art (SOTA) results in the ImageNet-1K dataset, surpassing the Swin Transformer by a significant margin.
In this article, we will take a closer look at the inner workings of GC ViT, and the techniques that enabled it to achieve such results.
GC ViT — Improving on the Swin Transformer
Since the Swin Transformer (Liu et al., 2021) was published in 2021, it has cemented itself as one of the most important transformer-based vision models.
The Swin Transformer introduced important techniques such as hierarchical feature maps and window-based attention, which allowed it to achieve competitive performance compared to conventional convolutional neural networks. Today, the Swin Transformer is used as the backbone architecture in a broad range of vision tasks, including in image classification and object detection.
Despite its progress, the Swin Transformer suffers from certain shortcomings. Most notably, the window-based attention used in Swin Transformer constraints the computation of interactions to within each window, and restricts cross-window interactions.
The diagram above shows an example of the limitation of window-based attention used in Swin Transformer. The input image is split into separate windows, and self-attention is then computed only within each window. This restricts the computation of long-range interactions between different objects in the global image. For example, the dog and the ball are split into different windows, and the model is restricted from learning interactions between the two objects. The lack of cross-window connections limits the ability of the model to capture long-range dependencies, which are crucial for accurate representation modeling.
The Swin Transformer tries to introduce some cross-window connections by using shifted-window based attention. However, this is computationally expensive and it does not fundamentally address the lack of global connections. As we will see later, GC ViT improves on this by providing both local and global connections in a single architecture.
Architecture of GC ViT
The overall architecture of GC ViT is shown in the diagram above. As we can see, GC ViT is made up of 4 different stages and each stage consists of alternating blocks of local and global multi-head self-attention (MSA) layers. Local MSA extracts local, short-range information while global MSA extracts global, long-range information. This allows GC ViT to flexibly attend to both short and long range dependencies. In between stages, GC ViT uses a downsampling block to create hierarchical feature maps that are similar to the Swin Transformer.
The main contribution of GC ViT is the global token generator and the global MSA layer. In the next section, we will examine them in greater detail.
The overarching principle of global self-attention is to create global connections between every region of an image. To do so, let’s understand how self-attention splits an image into different patches.
From the diagram above, we see that each image is split into separate windows (shown in purple). Each window is further split into patches (shown in red). In local self-attention, computation of attention between patches is restricted to within each local window. In other words, there are no cross-connections between patches in different windows, which limits the modelling power of the network.
Global self-attention aims to address this by introducing connections between patches and windows, as illustrated by the animation below. These global connections between patches and windows allows GC ViT to attend to global locations in the image, effectively modelling long-range dependencies.
In local MSA (multi-head self-attention), the Query, Key and Value vectors are derived from patches in a local window and attention is computed only within each local window. In contrast, in global MSA, only the Key and Value vectors are derived from patches in the local window. The Query vector is a global query token derived from all windows. The diagram below illustrates the difference between local MSA and global MSA.
The global query token is generated at each stage of the network by a Global Query Generator, which takes feature maps as input and extracts global features as the global query token. The global query token encompasses information across the entire input feature map for interaction with local key and value vectors.
The global query generator consists of a series of operations as shown in the diagram below.
The interaction of the global query token with local key and value vectors allows the computation of global attention. This effectively enlarges the receptive field and allows the model to attend to various regions in the feature map.
GC ViT also introduced a novel way of downsampling feature maps between stages for creating hierarchical feature maps. Interestingly, GC ViT uses convolution layers for downsampling. The authors believe that using convolution for downsampling provides the network with desirable properties such as locality bias and cross-channel interactions. The operations used in the downsampling block are shown in the diagram below.
Notice the similarity of the first 4 operations with that of the global query generator. In fact, the authors call the first 4 operations the ‘Fused-MBConv’ block and it is inspired by EfficientNetV2.
GC ViT was trained for image classification on the ImageNet-1K dataset. The table below compares the performance of GC ViT on the ImageNet-1K dataset against other CNNs and ViTs, including the Swin Transformer. As we can see, GC ViT achieves a new state-of-the-art benchmark. Furthermore, GC ViT models have better or comparable computational efficiency in terms of the number of FLOPs.
Compared with the Swin Transformer, GC ViT achieves better performance with fewer FLOPs, demonstrating the benefits of combining both local and global self-attention. However, do note that the Swin Transformer has a convolution-free architecture, whereas the GC ViT uses convolution operations for the computation of global attention and downsampling.
GC ViT introduces a novel architecture that combines both local self-attention and global self-attention, which allows the network to model both short and long term interactions. By removing complex and expensive operations and masks required in other ViTs, GC ViT achieves a new SOTA in the ImageNet-1K dataset while being more computationally efficient.
However, it should be noted that unlike the Swin Transformer, GC ViT is not a convolution-free architecture and some of its performance may be derived from convolution’s inductive bias.
Like this article?
Thank you for reading! I hope that this article has been useful for you. If you would like to subscribe to a Medium membership, consider using my link. This helps me to continue creating content that is useful for the community! 😄
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot