Original Source Here
Spatial Transformer Networks — Backpropagation
A Self-Contained Introduction
Spatial Transformer modules, introduced by Max Jaderberg et al., are a popular way to increase spatial invariance of a model against spatial transformations such as translation, scaling, rotation, cropping, as well as non-rigid deformations. They achieve spatial invariance by adaptively transforming their input to a canonical, expected pose, thus leading to a better classification performance.
In this four-part tutorial, we cover all prerequisites needed for gaining a deep understanding of spatial transformers. In the first two posts, we have introduced the concepts of forward and reverse mapping and delved into the details of bilinear interpolation. In the last post, we have introduced all building blocks a spatial transformer module is made of. Finally, in this post, we will derive all necessary backpropagation equations from scratch.
Before we start deriving formulas, let us quickly take a look at how the gradient is flowing back through a spatial transformer module:
The animation above clearly illustrates why spatial transformers networks can be trained end-to-end using standard backpropagation. We start with the gradient at the output of the module, which already has been computed in a higher layer. The first thing we have to do, is to derive explicit formulas to propagate (or flow) this gradient back through the sampler to both the input feature map and the sampling grid. Than we will have to derive the formulas governing the backpropagation through the grid generator. Remember that sampler and grid generator are both parameter less operations, i.e. don’t have any trainable parameters. Lastly, we have to backpropagate the gradient through the localisation network, which is a standard neural network, hence no new formulas have to be derived here. It is the localisation network where a parameter update is taking place.
If you have never come across backpropagation and have problems to grasp the concept of gradient flow, please take a look at my introductory post.
Gradient w.r.t sampling grid coordinates
As in all previous posts we will assume the sampler is using bilinear interpolation to transform the input feature map. Let us quickly recall the corresponding formula, which we derived in the second post. For each entry of the sampling grid:
the sampler first finds the four neighboring values by taking the floor and ceil operations:
We dropped the superscript “𝑠“ for the sake of clarity. Next, the sampler calculates the horizontal distance from the sample point to its right cell border and the vertical distance to the top cell border:
Finally, it takes a weighted average to produce the output:
To get an intuition for the wanted derivatives, let us wiggle a single (!) entry of the sampling grid:
We see, that the wiggling affects only a single pixel in the output feature map. This is to be expected, since the sampler operates independently on each entry of the sampling grid (the reason why the sampler lends itself perfectly to parallelization). To backpropagate the loss error from the output feature map to the sampling grid, all we have to do is to apply the chain rule:
where 𝓛 is the loss function. Next, we must take the derivative of 𝑉 w.r.t to 𝑥:
which requires us to take the derivative of the horizontal distance:
To proceed further we have to take a look at the derivative of the ceiling operation:
As we can see, the ceiling operation is piecewise constant and the derivative of a constant is zero. The ceiling operation is discontinuous at integer values of 𝑥 and is non-differentiable there.
Technically speaking we cannot apply gradient descent on a non-differentiable function. Our remedy is the so-called sub-derivative, which is an extension of the derivative, see references. Practically it boils down to setting the derivative to zero at integer values of 𝑥:
Formally, we are now calculating the sub-gradients instead of gradients. Our final formula is:
and after rearranging:
For the 𝑦 component we get accordingly:
Gradient w.r.t input feature map
Before we dive into mathematical formulas, let us again first develop an intuition. This time we must wiggle the value of a pixel in the input feature map, say at coordinates (2, 1):
We see, that wiggling a single pixel in the input feature map, causes several pixels in the output feature map to change. To understand the reason, let us take a closer look at the sample points of the affected output pixels:
We notice that all mentioned sample points have something in common: the input pixel at coordinates (2, 1) always belongs to one of their four neighboring points used in bilinear interpolation. Please also notice how input pixel (2, 1) is sometimes the upper right neighbor, sometimes lower left neighbor and so on.
The chain rule used to backpropagate the error now becomes:
where the two sums consider the fact, that each pixel in the input feature map might (potentially) affect multiple pixels in the output feature map. In the next step, we must evaluate the expression ∂𝑉/ ∂𝑈, which strongly depends on the relative position of 𝑈 with respect to 𝑉‘s sample point (upper left neighbor, upper right neighbor etc.). To this end we rewrite the bilinear interpolation formula in the following way:
The corresponding derivatives are:
We now have all the necessary formulas to compute the gradient. To get a better intuition for the whole procedure, let us apply it to the example in the animation above. Here, input pixel (2, 1) affects the following five output pixels (0, 0), (0, 1), (1, 0), (1, 1) and (2, 1):
The main challenge of the procedure seems to lie in finding all the affected output pixels. Luckily, in the actual implementation we can omit the explicit search altogether by exploiting linearity:
To this end we first initialize an empty array for the gradient ∂𝓛 / ∂𝑈 and then iterate over the entries of the sampling grid. For each entry we use the second last formula to compute all four derivatives ∂𝑉/ ∂𝑈, which we subsequently multiply by the corresponding entry of the gradient
∂𝓛 / ∂𝑉. The last remaining step, is to add the four computed values to the gradient array. Please note, that each value is added at a different position, defined by the positions of the four neighboring points. At the end of the whole procedure, each entry of the gradient array will contain a complete sum of all affected output pixels.
Backpropagating through the grid generator
We have seen how the loss function depends on all the coordinates of the sampling grid:
Furthermore, each sample coordinate is a function of the parameters provided by the localisation network:
Applying the chain rule for multivariate functions hence gives us:
In the following, we will assume the grid generator is using an affine transformation:
Since the target coordinates lie on a regular sampling grid, we have:
such that the above equation reduces to:
The corresponding derivatives are:
and zero for the remaining cases. To obtain the final formulas, we have plug those derivatives back into the chain rule above:
As mentioned in the first section, the grid generator is usually implemented as a standard neural network, such as a fully connected network or a convolutional network. For this reason, we don’t need to derive any new backpropagation formulas.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot