Original Source Here
A Primer on Atrous(Dilated) and Depth-wise Separable Convolutions
What are atrous/dilated and depth-wise separable convolutions? How are the different from standard convolutions? What are their uses?
With properties such as weight sharing and translation invariance, Convolutional layers and CNNs have become ubiquitous in Computer Vision and Image Processing tasks using deep learning methods. With that in mind, this article aims at discussing some of the developments we’ve seen in convolutional networks. Specifically we focus on two developments: Atrous(Dilated) convolutions and Depth-wise Spearable convolutions. We will see how these two types of Convolutions work, how they are different from normal convolutions and why we may want to use them.
Before we get into the topic, let’s quickly remind ourselves how convolutional layer works. At its core, convolutional filters are simply feature extractors. What were hand crafted feature filters before are now learned through the “magic” of back-propagation. We have a kernel(weights of the conv layer) that is slid over the input feature map and at each location, element-wise multiplication followed by a summation of theproducts is performed to obtain a scalar value. The same operation is performed at each location. Fig. 1 shows this in action.
The convolutional filter detects a particular feature by sliding over the input feature map, i.e, it looks for that feature at each location. This intuitively explains the translation invariance property of Convolutions.
To understand how atrous convolution differs from the standard convolution, we firs need to know what receptive field is. Receptive Field is defined as the size of the region of the input feature map that produces each output element. In the case of Fig.1, the receptive field is 3×3 as each element in the output feature map sees(uses) 3×3 input elements.
Deep CNNs use a combination of Convolutions and max-pooling. This has the disadvantage that, at each step, the spatial resolution of the feature map is halved. Implanting the resultant feature map onto the original image results in sparse feature extraction. This effect can be seen in Fig. 2. The conv. filter downsamples the input image by a factor of two. Upsampling and imposing the feature map on the image shows that the responses correspond to only 1/4th of the image locations(Sparse feature extraction).
Atrous(Dilated) convolution fixes this problem and allows for dense feature extraction. This is achieved a new parameter called rate(r). Put simply, atrous convolution is akin to the standard convolution except that the weights of an atrous convolution kernel are spaced r locations apart, i.e., the kernel of dilated convolution layers are sparse.
Fig. 3(a) shows a standard kernel and Fig. 3(b) a Dilated 3×3 kernel with a rate r = 2. By controlling the rate parameter, we can arbitrarily control the receptive fields of the conv. layer. This allows the conv. filter to look at larger areas of the input(receptive field) without a decrease in the spatial resolution or increase in the kernel size. Fig. 4 shows a dilated convolutional filter in action.
Compared to standard convolution used in Fig. 2, it can be seen in Fig. 5 that dense features are extracted by using a dilated kernel with rate r=2. Dilated convolutions can be trivially implemented by just setting the dilation parameter to the required dilation rate.
Depth-wise Separable Convolution
Depth-wise separable convolution was introduced in Xception net. Fig.6 shows a standard convolution operation where the convolution acts on all channels. For the configuration shown in Fig. 6, we have 256 5x5x3 kernels.
Fig. 7(a) shows depth-wise convolution where the filters are applied to each channel. This is what differentiates a Depth-wise separable convolution from a standard convolution. The output of the depth-wise convolution has the same channels as the input. For the configuration shown in Fig. 7(a), we have 3 5x5x1 kernels, one for each channel. Inter-channel mixing is achieved by convolving the output of depth-wise convolution with a 1×1 kernel of required number of output channels (Fig. 7(b)).
Why choose Depth-wise Separable Convolution?
To answer this we take a look at the number of multiplications required to perform a standard convolution and a depth-wise separable convolution.
For the configuration specified in Fig. 6, we have 256 kernels of size 5x5x3. The total multiplications required to compute the convolution:
256*5*5*3*(8*8 locations) = 1228800
Depth-wise Separable Convolution
For the configuration specified in Fig. 7, we have 2 convolution operations:
1) 3 kernels of size 5x5x1. Here, the number of multiplications required is: 5*5*3*(8*8 locations) = 4800
2) 256 kernels of size 1x1x3 for the 1×1 convolution. The number of multiplications required: 256*1*1*3*(8*8 locations) = 49152
Total multiplications required for Depth-wise separable convolutions: 4800 + 49512 = 54312.
We can quite clearly see that the depth-wise convolutions require much less computations than the standard convolution.
In pytorch, depth-wise separable convolutions can be implemented by setting the group parameter to the number of input channels.
Note: The groups parameter in pytorch has to be a multiple of the in_channels parameter. This is because in pytorch, the depth-wise convolution is applied by dividing the input features into groups=g groups. More info here.
This post delved into two popular types of convolution: atrous(dilated) convolution and depth-wise separable convolutions. We saw what they were, how they were different from the standard convolution operation and also saw the advantages they posed over the standard convolution operation. Finally we also saw how the atrous(dilated) and depth-wise separable convolution can be implemented using pyTorch.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot