Attention in Computer Vision, Part 3: GE



Original Source Here

Attention in Computer Vision, Part 3: GE

Photo by Marek Piwnicki on Unsplash

Introduction

In this article, gather-excite will be discussed, an attention mechanism that aggregates information from large receptive fields and redistributes it to local features for expressing long-range spatial interactions. You can find the GitHub repository for this article here.

Gather-Excite

Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks opens by noting that, in theory, the receptive field of convolutional neural networks is sufficiently extensive to cover the totality of input images. However, the effective receptive field in practice is much smaller and not global.

The effective receptive field of a 5-layered convolutional neural network with 3 X 3 kernels, for example, encompasses the centre of the theoretical receptive field, but the edges and corners are not included.

Uniforms is when every kernel weight is 1, and random is when the parameters are randomly initialized. Other than in Random + ReLU, there are no nonlinearities. Unless otherwise specified, grid sizes are the same as the theoretical receptive field. Image from “Understanding the Effective Receptive Field in Deep Convolutional Neural Networks.”

In a 10-layered CNN, the discrepancy between its effective and theoretical receptive fields is more pronounced.

Ibid.

Overall, the deeper a network gets and the larger its theoretical receptive field, the more its effective receptive field struggles to enlarge.

Ibid.

Training slightly abates this problem, but it is only partially successful.

A ResNet with 17 residual blocks and no downsampling or pooling before and after training on CIFAR-10 with a grid size of 32 X 32. The theoretical receptive field is 74 X 74. Ibid.
A ResNet with 16 residual blocks and 4 downsampling operations (factor of 2) before and after training on CamVid (resolution 960 X 720) with a grid size of 505 X 505, the size of the theoretical receptive field. Initially, the effective receptive field is 100 X 100, but it expands to 150 X 150 after training. Ibid.

Hence, spatially far-off neurons do not communicate, thus hindering neural networks’ performance on tasks where adequate long-range interactions are indispensable. Ergo, gather-excite (GE) is suggested, a module that aggregates spatial data from large spatial neighbourhoods via the gather module, ξG, and redistributes the information back to every activation with the excite module, ξE. In other words, the data is downsampled so the resulting neurons have large receptive fields, and they are redistributed to the original features they stemmed from to force long-range interactions.

The semi-transparent, whitish squares on the leftmost figure depict the areas from which information is collected. The middle tensor is the result of the aggregation, and ξE’s objective is to redistribute every neuron in it to its associated region in the original data. Image from “Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks.”

Countless modules are available for these two units, a simple and efficient pairing being gathering with average pooling and exciting by interpolating the output to the original dimension, applying sigmoid, and multiplying it by the original input.

Nearest-neighbour interpolation is used. Image by the author.

An extent factor, e, is used to determine the stride and kernel size, where the former is set to e and the latter to (2e – 1). Therefore, e controls the receptive field of each neighbourhood and the downsampling factor. In the diagram above, however, this heuristic is not followed, and the kernel size and stride are 2 for illustration purposes.

Additionally, a special instance of this would be when global pooling is utilized so every feature interacts with not merely activations within its neighbourhood, but with every activation in its channel. In that case, GE would be equivalent to SE-Var1 from the efficient channel attention paper.

GE is inserted before the residual addition within each residual block. Implementations for GE will be based on the timm package because the official code is incomplete.

This parameter-free of gather-excite, called GE-θ- (θ- symbolizes the nonexistence of parameters), consistently enhances the score of a ResNet-50 and, with global pooling, exceeds the accuracy of a ResNet-101.

Top-1 error on ImageNet. Red dots represent ResNet-50 with or without GE-θ-, and the purple dot is vanilla ResNet-101. Image from “Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks,” with slight modifications.

For squeeze-and-excitation, the gap between the performance of average and max pooling for aggregation was small, but for GE-θ-, max pooling is considerably worse, especially for a global extent, where it backfires and hurts performance. The two kinds of pooling can coalesce for better performance, but the researchers do not pursue that track.

Top-1 and top-5 error on ImageNet. Ibid.

Naturally, parameterizing GE-θ- should further help, so the authors decide to supplant average pooling in ξG with convolutions to get GE-θ. Specifically, 3 X 3 depthwise convolutions with strides of 2 are used to downsample the input with a factor of e, so the number of convolutions would consequently be log₂(e). For a global extent ratio, a single depthwise convolution whose kernel size is the spatial dimension is used. Batch normalization and ReLU are also appended for a non-global extent ratio.

Line 75 is to remove the final ReLU.
Top-1 error on ImageNet. Red dots represent ResNet-50 with or without GE-θ, and the blue dots are ResNet-50 with GE-θ-. Ibid., with slight modifications.

GE-θ is more fruitful when added to later stages, but optimal performance is procured when it is present in every stage. Because GE-θ is not as lightweight as GE-θ-, the authors recommend omitting it from stage 2 of ResNet-50 in resource-constrained environments.

Top-1 and top-5 error on ImageNet. When the extent ratio is not specified, it is global. Ibid.

Inspired by the potency of a parameterized gather unit, ξE is also parameterized by prepending 1 X 1 convolutions before interpolation. Concretely, akin to a bottleneck multilayer perceptron, a 1 X 1 convolution compresses the channels, ReLU is applied, and another 1 X 1 convolution increases the number of channels to the original one.

Employing global average pooling for ξG alongside this parameterized excite unit would produce squeeze-and-excitation, but the authors couple a parameterized ξE with a parameterized ξG to supplement the benefits of a parameterized gather unit with that of a parameterized excite unit.

Put another way, it would be identical to squeeze-and-excitation, except that global average pooling is substituted with depthwise convolutions. The name of this variant is GE-θ+, and it can be viewed as a combination of spatial and channel attention, whereas the previous two iterations involved no cross-channel relationships.

The reduction factor is fixed at 16.

GE-θ+ beats GE-θ- and GE-θ in terms of error on ImageNet, and the increase in flops is trivial, although the number of parameters is much higher than a plain ResNet-50.

Remarkably, GE-θ+ nears the error rate of a ResNet-152 (21.87%). Ibid.

With ResNet-101, GE-θ- and GE-θ demonstrate subpar performance relative to squeeze-and-excitation, but GE-θ+ surpasses it, which is logical, for it is a general case of SE and should be able to approximate it.

Ibid.

Finally, on the mobile-friendly ShuffleNet 1 X (g = 3), GE-θ and GE-θ+ perform well, with the latter outperforming SE by an impressive margin, albeit the researchers note that the parameter-free GE offers no improvements. If parameter storage is an issue, the authors advise using a parameterized gather-excite only at select layers.

Ibid.

Conclusion

In this article, gather-excite was examined, an attention mechanism that enables spatially distant features to interact with one another and overcome the limitations of the effective receptive field by gathering information from neighbourhoods of neurons through average pooling or convolutions, and redistributing them back throughout the original input through the excitation module.

In the next article, selective kernel will be studied, a module that extracts multi-scale information from the data through branches with different kernel sizes, similar to Inception, and combines them through an attention mechanism, thereby being able to dynamically adjust its receptive field.

Related articles:

References:

Social media:

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: