ResNeXt Explained, Part 2

Original Source Here

ResNeXt Explained, Part 2

Photo by Melina Kiefer on Unsplash


In this article, we are going to look at the other two forms of the ResNeXt block that take advantage of early concatenation and grouped convolutions to reformulate the original block.

Without further ado, let’s get coding!

Form B: Concatenation

Previously, we saw how a ResNeXt branch utilized Inception’s split-transform-merge strategy: Reduce the number of channels (split), run through 3 X 3 convolutions (transform), and bring the width back up and add the results of every branch together (merge).

Form B uses form A’s strategy for splitting and transforming, but it approaches merging differently: What if we concatenate the output of the branches’ transforms and then increase their width?

For instance, take the example from the first part: The 32 branches are fed 256-dimensional tensors, which they downscale to 4 channels. After that, they proceed to transform the data with 3 X 3 convolutions without changing the width. But now, instead of following in form A’s footsteps, we concatenate the output of the branches together to get 128 = 32 * 4 channels, and we double the width via 1 X 1 convolutions.

A ResNeXt block (form B). Image from “Aggregated Residual Transformations for Deep Neural Networks

Intuitively, the said method is more sensible and closer to Inception, where the output of the transforms (e.g. 3 X 3 and 5 X 5 convolutions) are simply concatenated without modifying the number of channels. As a matter of fact, form B is just like an Inception block except for the transforms.

Here’s the thing though: Form B is identical to form A! Everything, from the number of parameters to FLOPs, remains constant, and form B is merely another depiction of the block we saw in the previous article. Although that may appear obvious to some, it is worth inspecting the matter more closely to attain a firmer grasp on ResNeXt.

Form A Versus Form B

It should be apparent the two forms are identical with respect to splitting and transforming, so we’ll focus solely on the merge aspect. Before that, we should examine a simpler problem: Assume there are two 4-dimensional vectors, A and B, whose elements are a1, a2, a3, a4 and b1, b2, b3, b4 respectively. We would like to transform A by multiplying its elements by the values in vector X, which consists of x1, x2, x3, x4, and summing the results (i.e. a1*x1 + a2*x2 + a3*x3 + a4*x4, a.k.a. the dot product of A and X). B will be transformed as well, but its weight vector is Y (made up of y1, y2, y3, y4) instead of X. Finally, the results of both operations will be aggregated to give one final number, a1*x1 + a2*x2 + a3*x3 + a4*x4 + b1*y1 + b2*y2 + b3*y3 + b4*y4.

Another way to reach that value is if we concatenated A and B to get a larger, 8-dimensional vector C, composed of a1, a2, a3, a4, b1, b2, b3, b4. X and Y too could be concatenated into Z, a vector constituted of x1, x2, x3, x4, y1, y2, y3, y4. Now, we can calculate the dot product of C and Z to get a1*x1 + a2*x2 + a3*x3 + a4*x4 + b1*y1 + b2*y2 + b3*y3 + b4*y4, which is equal to our first answer.

Please note that A and B could be viewed as two feature maps with spatial dimensions of 1 X 1 and widths of 4, and X and Y as 1 X 1 filters with 4 channels. Moreover, as we proved, applying X to A and Y to B and then adding them would be like concatenating A, B and X, Y and then running the former through the latter.

We can extend this idea further: If A and B had larger spatial sizes instead of one value per channel, and there were more than just one 1 X 1 convolution per feature map, the same story would unfold.

Put another way, when we have several feature maps of arbitrary sizes, performing convolutions on them and then adding the results is just like concatenating the data and convolutions first and then applying the convolutions.

That’s all there is to form B: Rather than separately enlarging the width of the 4-dimensional data outputted by the branches before summing them, we could concatenate prior to multiplying the width to get identical results.

Bridging the gap between form A and form B is straightforward, but form C is a tad bit more involved. Specifically, we must delve into:

Grouped Convolutions

Introduced in “ImageNet Classification with Deep Convolutional Neural Networks”, grouped convolutions were initially exploited as an engineering gimmick for training on two GPUs, but ResNeXt showed it can lead to a higher score if used properly.

Like with the second form of the ResNeXt block, we shall use an example to study grouped convolutions: Imagine we have a feature map consisting of 128 channels, and we would like to transform it without changing the width, which can be done using 1 X 1 convolutions. Concretely, each filter would span 128 channels for the 128 channels in the input, and there’d be 128 filters in total for the 128 channels in the output (spatial dimensions are being omitted for the sake of simplicity). Note, the filters are all given the entire 128 channels, not just half or one fourth of them; in other words, the input hasn’t been split into multiple groups, and there is just one group.

What would happen if there were two groups? The input would be divided into two smaller tensors, each with 64 channels. Consequently, the filters would be cut into two groups as well, and 64 of them would correspond to the first group of input channels, and the rest would correspond to the other group. Therefore, each filter would have 64 channels because it’s operating on only half the input, which has 64 channels.

For a group size of four, the input would be quartered, with each section having 32 channels, and the filters too would get divided into 4 groups of 32 filters, each with 32 channels. We can keep going, but that is the essence of grouped convolutions.

Formally, if there are n_in input channels, n_out output channels, and g groups, there would be n_in/g channels in each group, thus the filters would each have n_in/g channels. There’d be n_out/g filters per group, or n_out filters in total. A given filter operates on exclusively n_in/g channels, and the different groups do not interact wth one another whatsoever.

Illustrated below we have an example of a convolution operation with 8 input channels, 4 output channels, and a group size of 4. Like you can see, there are 4 groups with 2 channels in each one, which indicates each filter takes in 2 = 8 / 4 channels. Since the number of groups equals the number of output channels, there’s one filter per group, but if the width of the output were multiplied by a factor of f, the number of filters would be timed be f as well.

Grouped convolution with 8 input channels, 4 groups, and 4 output channels. Source

That should give you a sufficient understanding of the definition of grouped convolutions, but how should one think about grouped convolutions (i.e. the why, not the how)?

It is very straightforward, actually: When we perform a grouped convolution on, say, a tensor with 128 input and output channels with a group size of 32, it is as if the input is divided into 32 batches of completely independent 4-dimensional tensors. Then, 4 = 128 / 32 filters with widths of 4 (the number of channels in each group) are applied to each group, which spit out 4 channels per group, or 128 channels in total. To put it another way, it is similar to having 32 4-dimensional tensors and transforming them by applying 4 filters of width 4 to each one (i.e. a regular convolution with 4 input and output channels and a group size of one).

Sound familiar? That is exactly the transform part of forms A and B, where there are 32 branches and each performs a convolution on a 4-dimensional input to give back a 4-dimensional output. It is time for:

Form C: Grouped Convolutions

Lastly, we have the third and final form of the ResNeXt block, which is the most concise of the bunch. Its distinguishing mark is the use of grouped convolutions to transform the data, but its split strategy also needs to be modified to accommodate for grouped convolutions, but it’s nothing too complex.

So, we already know how transforming 32 4-dimensional tensors (e.g. form A and form B) via 3 X 3 convolutions with 4 output channels per tensor is just like concatenating those tensors and running them through a 3 X 3 convolution with 128 output channels and 32 groups. Therefore, to integrate grouped convolutions into form B, we could concatenate the results of the 1 X 1 convolutions in the branches that split the data into 4 dimensions to get one 128-dimensional tensor and perform the said grouped convolution on it. Consequently, there’d be no need for a second concatenation, and we would be left with something like:

Reformulation of form B with grouped convolutions (without a skip connection). Crappy diagram by me

Almost there! Remember how in form B, we realized concatenating first and then performing 1 X 1 convolutions is the same as doing it the other way around? Notice that in the block above, we start with 1 X 1 convolutions and concatenate thereafter, which is equivelant to beginning with concatenation then applying 1 X 1 convolutions. However, there’s only one tensor (i.e. the 256-dimensional input), so concatenation would be superfluous, and we would only need to do 1 X 1 convolutions to downscale the input to 128 channels:

Form C of the ResNeXt block. Image from “Aggregated Residual Transformations for Deep Neural Networks

There you have it! This is the third and final form of the ResNeXt block, and it is the one all packages use to implement ResNeXt. Of course, there is more to this capable network than we explored here, and I strongly suggest you do a deep dive into the many concepts seen in this short series (grouped convolutions, Inception, etc.), but by now, you should have a solid apprehension of ResNeXt and hopefully appreciate its various components. It really is an excellent architecture!


In this article, we learned about early concatenation, a technique for reformulating the initial ResNeXt block so it wouldn’t require individual 1 X 1 convolution for merging, and grouped convolutions, a type of convolution that enables one to treat a tensor as multiple single tensors and perform convolutions on each of them separately, which is used in the third form of the ResNeXt block.

This was the second half, and there’ll be no more articles after this one. I hope you learned at least one thing from my writings and enjoyed reading them as much as I did producing them!

Please, if you have any questions or feedback at all, feel welcome to post them in the comments below and as always, thank you for reading!

Related articles:

Social media:


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: