Original Source Here
How image processing networks have changed over time
From convolutions to self-attention to dense layers and much more
Computer vision has been evolving quite rapidly for some time now and I think it’s time we take a step back and investigate how the architectures have changed over time, and see the pros and cons of each architecture. Computer vision is a huge field and is fundamentally challenging since images to computers are just matrices of numbers. The main challenge between these evolving networks is what are the most optimal operations and procedures to apply to these matrices to turn these numbers into quantifiable useful features such as colors, textures, shades, and much more.
1. Convolutional Neural Networks (CNNs)
CNNs are the most popular type of image networks and thus I will just be giving a quick overview of CNNs and then we will start covering other networks. CNNs do their feature learning through a series of “convolutions”, pooling and activation layers. These convolution layers are made up of a “kernel” which simulates a sliding window. This window keeps sliding over the matrix of numbers, applying a simple operation over each window such as multiplication to simplify this window into a feature. There are a lot of variables related to kernels that you can tweak such as the kernel size and the stride (the shift amount of the sliding window).
The next type of layer, which is the pooling layer attempts to reduce the spatial size of the image. This can be done in several ways, there are max pooling, min pooling operations, and much more.
A CNN basically passes the image through a series of convolutional, pooling, and activation layers. The final result is essentially a much smaller image (or a matrix) of features.
This might seem like a really high-level overview of CNNs, but I think most of the people reading this article are going to be familiar with them.
2. Vision Transformers (ViTs)
Vision transformers replaced the convolutional and pooling layers (the CNN layers) with self-attention. Visual transformers use Multi-head Self Attention layers. Those layers are based on the attention mechanism that utilizes queries, keys, and values to “pay attention” to information from different representations at different positions.
A classic transformer block for images starts with a normal Feed Forward Network followed by a Multi-head Self Attention layer. One interesting bit is that the feed-forward network used an activation function called Gaussian Error Linear Unit which aims to regularize the model by randomly multiplying a few activations by 0.
Self-attention essentially simulates creating connections between pieces of information (or inputs). It does this by performing its operations in a bi-directional manner so that the order of input doesn’t matter. Those operations are mostly dot products using the keys, queries, and values. The concept of searching dictionaries using keys and queries to obtain values is quite common in computer science.
Another mechanism ViTs use to make sense of images is positional embeddings. ViTs start by breaking the image into very small patches (typically 16 x 16). Part of their operations is to calculate “distances” between those patches which represent the degree of similarity between these patches. This also allows the transformer to capture features that are relevant across the whole image. A common issue with CNNs is that kernels seem to capture features that are much more local to a certain patch of the image rather than across the whole image.
While CNNs avoid hand-crafted feature-extraction, the architecture itself is designed specifically for images and can be computationally demanding. Looking forward to the next generation of scalable vision models, one might ask whether this domain-specific design is necessary, or if one could successfully leverage more domain agnostic and computationally efficient architectures to achieve state-of-the-art results.
Source: Google AI Blog
Another point to note is that ViTs use less memory than CNNs. This might not seem obvious since self-attention layers are known to be compute-intensive, however, a CNN typically uses a large number of convolutional layers and it ends up using more memory than ViTs. However, ViTs require much more data to be pre-trained on to achieve the same level of performance when compared to CNNs. You can typically start seeing the magic of ViTs when pre-trained on datasets such as JFT-300M which is much larger than the typical ImageNet dataset. But, in all fairness, you can easily find the pre-trained weights for a ViT.
We first train ViT on ImageNet, where it achieves a best score of 77.9% top-1 accuracy. While this is decent for a first attempt, it falls far short of the state of the art — the current best CNN trained on ImageNet with no extra data reaches 85.8%
Source: Google AI Blog
This is quite an interesting architecture. Although I was expecting an upgrade over self-attention to make it more optimal, the MLP mixer sticks with the basics. It doesn’t use any convolutional or self-attention layers, instead, it just used the classic multi-layer perceptions! With that in mind, it still achieves SOTA performance. Okay, so the next question is how does the MLP mixer extract features from images?
The MLP mixer reasons about the images in patches (similar to the ViTs) with the difference that it also takes into account the channels across the patches. The MLP mixer uses 2 different types of layers. The first type of layer (will call it channel-mixing layer ), operates on independent patches of the image and allows communication between their channels (hence channel-mixing) during learning. However, the second type (will call it patch-mixing) works in the same way but for patches (allows communication between different patches).
The MLP mixer essentially attempts to learn the best way to mix the channels and patches of the images. It also starts with encoding those patches through a “mixing table”. You can essentially think of it as a jigsaw puzzle, you keep trying to mix the pieces together until you get a meaningful output that is similar to the final goal you had in mind.
The core idea of modern image processing networks is to mix features at a given location or mix the features between different locations. CNNs perform both of those different types of mixing with convolutions, kernels, and pooling while Vision Transformers perform them with self-attention. However, the MLP-Mixer tries to do both in a more “separate” fashion (explained below) and only using MLPs. The main advantage of only using MLPs (which are basically matrix multiplications) is the simplicity of the architecture and the speed of computation.
The MLP mixer doesn’t exceed the previous architectures in performance, however, it is up to 3x times faster (which makes sense since it’s much more simple).
How computer vision has changed is quite a broad topic, I hope this article gave you a quick glance at the recent breakthroughs. I have been looking at a lot of computer vision papers over the last few years and I was just thinking to myself there have been quite a few changes in architecture and thus decided to write this article. I was also interested in reasoning about the different ways these networks are breaking down images into features (which is the whole idea of computer vision) and thus that was the core of my analysis. I hope you have enjoyed it, if I missed anything or you think something is not quite accurate please let me know down in the comments as I would love to learn!
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot