Original Source Here
How CNN architectures evolved?
Everyone is using AI these days for almost every task, you feed enough data to a complex model and it will work somehow. There’s a saying in science, the easier your algorithm is, the better your solution is. We should always keep in mind that using complex deep learning architecture for every simple problem is not a good idea. You should try to keep your model as simple as possible. This blog deals with different kinds of architectures and how they evolved over time.
Yann LeCun (the inventor of CNN) wrote a paper in 1998 that was called LeNet-5, and later on, this paper served as the basis of Deep learning. This was actively being used for zip code digit recognition back in the early days of AI. To train the LeNet-5 a lot of Bayesian techniques were used to set the proper weights initially, otherwise, it wouldn’t have converged properly. Then came the legendary paper with the architecture called AlexNet (From Geoff Hinton’s lab, inventor of Backpropogation). This was a truly large-scale model that can handle the ImageNet dataset. This paper outperformed every other technique by a huge margin and this started the whole research of convolutional neural networks.
Remember that we are talking about 2012 and a lot of modern techniques like batch normalization are not yet discovered. The below image shows the architecture of Alexnet, this looks fairly like the modern deep learning architectures we use today. The major difference between this architecture and today’s architecture is that we see two streams of layers, can you guess why? When this field came out, we didn’t have enough memory to run it on a single machine, so this network was designed to run in two parallel clusters. This paper was the first one to use Relu, it also used norm layers, used heavy data augmentation, like jittering, color changing and many more. It had a batch size of 128 and used SGD with momentum. A dropout of 0.5 was used to train the model. This was trained on GTX 580 which had only 3GB memory. Look carefully in the diagram, for the first Conv layer we have an output of 55x55x96 which is shared in 2 GPU depth wise 55x55x48 on each GPU. Also not every layer in the two GPU talks to the next layer present in the other GPU.
The next important network that came after the AlexNet(8 layers) were VGG(19 layers) and GoogleNet(22 layers). Let’s first look at VGGNet. In the below image we can see that VGG16 has almost double the layer compared to AlexNet. The other change they introduced in VGG was that they used only 3×3 Conv filters. Using three 3×3 Conv filter is the same as using 7×7 Conv filter, both have the same receptive field. But why use more layers if we could work with one layer? Having more layers gives us more non-linearity in decision boundaries and also fewer parameters. Three layers of 3×3 Conv filter have 3*(32C2) parameters compared to (72C2) for C channels per layer.
The above image shows us that even after applying 3×3 Conv filters our network is pretty heavy in memory. To process one image, we need around 100 MB of memory, VGG16 has 138 million parameters which is quite a lot even with today’s computing power. Another thing that you will see here is that as we move deeper into the network number of filters increases. We want to keep the same level of information throughout the network, so for initial layers, width and height are bigger so, in the deeper layer where size is smaller, we compensate that information by increasing the depth of the filters. Another thing which you will notice in the above architecture is that we have the most number of parameters in fully connected layers at the last. Out of 138 million 102 million parameters are just in the first fully connected layers. As we move ahead in this article, we’ll discuss how to reduce these number of parameters to make our model easier to train.
Another network that came in 2014 was from google called GoogLeNet. It was extremely good at saving memory, it only had 5 million parameters and there was no fully connected layer, unlike VGG16. Let’s first see the naïve implementation of the Inception module that helped google reduce the number of parameters. Here we take the input from the previous layer and apply the Conv filters in parallel and again concatenate the results (depth-wise) to feed it to the next inception module.
This network is still doing a lot of operations that’s why we are calling it naïve implementation, let’s see with an example how many calculations are happening here. We feed the input of 28x28x256 and we get an output of 28x28x672. The total number of operations happening in just one layer of the inception module is 854 million. Another thing to note here is that with each passing layer our output depth is going to keep on increasing because the pooling operation will preserve depth-wise features.
We introduce a 1×1 Conv filter before each Conv layer to reduce the number of parameters and also after the max pool layer. The below diagram shows you how 1×1 reduces the dimension in depth. Given below is the architecture of the inception module with extra 1×1 Conv filters. We can see from the calculation given in the below image that our calculation got reduced from 854 million to 358 which is more than twice the reduction in computation. We can also see that 1×1 is also helping us to keep the depth dimension in check. Adding the 1×1 Conv filter not only helps us reduce the parameters, but it also performs better than the naïve one. Adding a 1×1 Conv filter helps us in having more non-linearity and also reducing the redundancy in the feature maps by taking their linear combination. The authors of this paper also introduced two auxiliary classifiers to encourage discrimination in the lower stages of the classifier, to increase the gradient signal that gets propagated back, and to provide additional regularization. The auxiliary networks (the branches that are connected to the auxiliary classifier) are discarded at inference time.
Incpetionv3 paper https://arxiv.org/abs/1512.00567
The next architecture we are going to look at is called ResNet, it is a much deeper network with 152 layers with skip connection. So, the question here is can we just keep stacking these modules and make an extremely large network? The answer is NO, we have a problem called vanishing or exploding gradients. The below diagram shows us that the 56-layer network is performing worse than a 20-layer network on test data. Another thing to note here is that it is not overfitting the data due to a greater number of parameters. If it would have been a case of overfitting, the training accuracy of the 56-layer model would have been better than 20-layer one, only in testing it would have performed badly, but that’s not the case here. So, what is the problem here? The hypothesis for this problem is that deeper models are harder to optimize.
Let’s look at the architecture of ResNet, below diagram shows us that after every two-Conv layer we are having a skip connection. To make you understand what is changed by doing this in the learning process, we need to understand what each layer is trying to learn. If you’ll look at the normal Conv layer, you will see that it is trying to learn a mapping from one feature set to another, but adding skip connection changes everything. After adding the skip connection, it is now learning the delta, or the change it needs have to learn the next mapping, it is basically learning the residual rather than the actual transformation H(x). If I were to simply put it, it was earlier learning H_new(X) from H_old(x), now it is learning the delta which if we were to add to the layer H_old(X), we will get this layer H_new(x).
Other architectures which were similar to the above architectures are ResNet-50+, they just added a 1×1 Conv filter before and after the Conv layer. ResNet training is done in the following manner, Batch Normalization after every layer. SGD + Momentum (0.9), learning rate 0.1, divided by 10 when validation error plateaus, Mini batch size of 256, weight decay of 1e-5 and No dropout are used.
Another variation of ResNet is called wide residual network which has more filters in each Conv layer.
Other architectures combined the property of both ResNet and Inception layer and called it ResNeXt.
Another very complex network we have is called FractalNet it argues that the key to learning good mapping is transitioning from shallower to deeper layers effectively. It is very different from both ResNet and Inception net.
At last, we have DenseNet, here each layer is connected to every other layer in a feed-forward fashion. Authors of DenseNet say that this network removes the problem of vanishing gradient by having connections from shallow layers to even the deepest of the layers.
Other new architectures are SqueezeNet, U-net, EfficientNet, etc. Please let me know if you want a detailed explanation of any other recent architectures.
I hope now you have a much better understanding of how CNN architectures evolved with time.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot