Original Source Here
Attention in Computer Vision, Part 3: GE
In this article, gather-excite will be discussed, an attention mechanism that aggregates information from large receptive fields and redistributes it to local features for expressing long-range spatial interactions. You can find the GitHub repository for this article here.
Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks opens by noting that, in theory, the receptive field of convolutional neural networks is sufficiently extensive to cover the totality of input images. However, the effective receptive field in practice is much smaller and not global.
The effective receptive field of a 5-layered convolutional neural network with 3 X 3 kernels, for example, encompasses the centre of the theoretical receptive field, but the edges and corners are not included.
In a 10-layered CNN, the discrepancy between its effective and theoretical receptive fields is more pronounced.
Overall, the deeper a network gets and the larger its theoretical receptive field, the more its effective receptive field struggles to enlarge.
Training slightly abates this problem, but it is only partially successful.
Hence, spatially far-off neurons do not communicate, thus hindering neural networks’ performance on tasks where adequate long-range interactions are indispensable. Ergo, gather-excite (GE) is suggested, a module that aggregates spatial data from large spatial neighbourhoods via the gather module, ξG, and redistributes the information back to every activation with the excite module, ξE. In other words, the data is downsampled so the resulting neurons have large receptive fields, and they are redistributed to the original features they stemmed from to force long-range interactions.
Countless modules are available for these two units, a simple and efficient pairing being gathering with average pooling and exciting by interpolating the output to the original dimension, applying sigmoid, and multiplying it by the original input.
An extent factor, e, is used to determine the stride and kernel size, where the former is set to e and the latter to (2e – 1). Therefore, e controls the receptive field of each neighbourhood and the downsampling factor. In the diagram above, however, this heuristic is not followed, and the kernel size and stride are 2 for illustration purposes.
Additionally, a special instance of this would be when global pooling is utilized so every feature interacts with not merely activations within its neighbourhood, but with every activation in its channel. In that case, GE would be equivalent to SE-Var1 from the efficient channel attention paper.
This parameter-free of gather-excite, called GE-θ- (θ- symbolizes the nonexistence of parameters), consistently enhances the score of a ResNet-50 and, with global pooling, exceeds the accuracy of a ResNet-101.
For squeeze-and-excitation, the gap between the performance of average and max pooling for aggregation was small, but for GE-θ-, max pooling is considerably worse, especially for a global extent, where it backfires and hurts performance. The two kinds of pooling can coalesce for better performance, but the researchers do not pursue that track.
Naturally, parameterizing GE-θ- should further help, so the authors decide to supplant average pooling in ξG with convolutions to get GE-θ. Specifically, 3 X 3 depthwise convolutions with strides of 2 are used to downsample the input with a factor of e, so the number of convolutions would consequently be log₂(e). For a global extent ratio, a single depthwise convolution whose kernel size is the spatial dimension is used. Batch normalization and ReLU are also appended for a non-global extent ratio.
GE-θ is more fruitful when added to later stages, but optimal performance is procured when it is present in every stage. Because GE-θ is not as lightweight as GE-θ-, the authors recommend omitting it from stage 2 of ResNet-50 in resource-constrained environments.
Inspired by the potency of a parameterized gather unit, ξE is also parameterized by prepending 1 X 1 convolutions before interpolation. Concretely, akin to a bottleneck multilayer perceptron, a 1 X 1 convolution compresses the channels, ReLU is applied, and another 1 X 1 convolution increases the number of channels to the original one.
Employing global average pooling for ξG alongside this parameterized excite unit would produce squeeze-and-excitation, but the authors couple a parameterized ξE with a parameterized ξG to supplement the benefits of a parameterized gather unit with that of a parameterized excite unit.
Put another way, it would be identical to squeeze-and-excitation, except that global average pooling is substituted with depthwise convolutions. The name of this variant is GE-θ+, and it can be viewed as a combination of spatial and channel attention, whereas the previous two iterations involved no cross-channel relationships.
GE-θ+ beats GE-θ- and GE-θ in terms of error on ImageNet, and the increase in flops is trivial, although the number of parameters is much higher than a plain ResNet-50.
With ResNet-101, GE-θ- and GE-θ demonstrate subpar performance relative to squeeze-and-excitation, but GE-θ+ surpasses it, which is logical, for it is a general case of SE and should be able to approximate it.
Finally, on the mobile-friendly ShuffleNet 1 X (g = 3), GE-θ and GE-θ+ perform well, with the latter outperforming SE by an impressive margin, albeit the researchers note that the parameter-free GE offers no improvements. If parameter storage is an issue, the authors advise using a parameterized gather-excite only at select layers.
In this article, gather-excite was examined, an attention mechanism that enables spatially distant features to interact with one another and overcome the limitations of the effective receptive field by gathering information from neighbourhoods of neurons through average pooling or convolutions, and redistributing them back throughout the original input through the excitation module.
In the next article, selective kernel will be studied, a module that extracts multi-scale information from the data through branches with different kernel sizes, similar to Inception, and combines them through an attention mechanism, thereby being able to dynamically adjust its receptive field.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot