Spatial Transformer Tutorial, Part 1 — Forward and Reverse Mapping

Original Source Here

Spatial Transformer Tutorial, Part 1 — Forward and Reverse Mapping

A Self-Contained Introduction

Convolutional Neural Networks (CNN) possess the inbuilt property of translation invariance. This enables them to correctly classify an image at test time, even when its constituent components are located at positions not seen during training. However, CNNs lack the inbuilt property of scale and rotation invariance: two of the most frequently encountered transformations in natural images. Since this property is not built in, it has to be learnt in a laborious way: during training, all relevant objects must be presented at different scales and rotations. This way the network learns a redundant set of features for each scale and each orientation, thus achieving the desired invariances. As a consequence, CNNs are usually very deep and require a lot of training data to gain high accuracies.

Spatial transformer module transforms inputs to a canonical pose, thus simplifying recognition in the following layers (Image by author)

Spatial Transformer modules are a popular way to increase spatial invariance of a model against spatial transformations such as translation, scaling, rotation, cropping, as well as non-rigid deformations. They can be inserted into existing convolutional architectures: either immediately following the input or in deeper layers. They achieve spatial invariance by adaptively transforming their input to a canonical, expected pose, thus leading to a better classification performance. The word adaptive indicates, that for each sample an appropriate transformation is produced, conditional on the input itself. Spatial transformers networks can be trained end-to-end using standard backpropagation.

In this tutorial, we are going to cover all prerequisites needed for gaining a deep understanding of spatial transformers. In this first post, we will introduce the concepts of forward and reverse mapping. In the next post we will delve into the details of bilinear interpolation. In the third post, we will introduce all building blocks a spatial transformer module is made of. Finally, in the fourth and last post, we will derive all backpropagation equations from scratch.

Input Data

Spatial transformers are most commonly used on image data. A digital image consists of a finite number of tiny squares called pixels (pixel is short for picture element) organized into rows and columns. Each pixel value represents information, such as intensity or color.

Binary image of letter “T” (Image by author)

We use a coordinate system with the 𝑦-axis oriented downwards, as is common convention in computer vision.

The main characteristic of image data is the spatial relation between pixels. The spatial arrangement of pixels carries crucial information of the image content. Without the rest of the pixels, a single pixel has little meaning.

For reasons of clarity we will often use the “line plot” below to visualize image data in this tutorial. In this plot, spatial location is shown on the 𝑥-axis and 𝑦-axis with intensity values along the 𝑧-axis.

Line plot of the same image (Image by author)

The line plot clearly illustrates the discrete nature of digital images, with pixel values only defined on the equally spaced grid and undefined outside.

Whenever only spatial information is of importance, for example when we derive the gradients of the spatial transformer, we will use the following top-down view:

Top-down view of a line plot (Image by author)

One thing to keep in mind: pixel values can be discrete or continuous. For example in a 8-bit grayscale image, each pixel has a discrete value ranging between 0 and 255, where 0 stands for black, and 255 stands for white. Feature maps, on the other hand, generated by convolutional layers have continuous pixel values.

Spatial Transformations

A spatial transformation moves each point (𝑦, 𝑥) of the input image to a new location (𝑣, 𝑢) in the output image, while preserving to some extent spatial relationships of pixels in neighborhoods:

Transformation concepts (Image by author)

The basic spatial transformations are scaling, rotation, translation and shear. Other important types of transformations are projections and mappings.

The forward transformation 𝑇{…} maps a location in input space to a location in output space:

The inverse transformation 𝑇-1{…} maps a location in output space back to a location in input space:

Forward Mapping

The most straight forward way to implement spatial transformations is to iterate over each pixel of the input image, compute its new location in the output image using 𝑇{…}, and to copy the pixel value to the new location:

Forward mapping (Image by author)

Most of the time the new locations (𝑣, 𝑢) will not fall on grid points in the output image (are not integer values). We solve this, by assigning the nearest integers to 𝑣 and 𝑢 and use these as output coordinates.

Forward mapping has two main disadvantages: overlaps and holes. As we see in the animation above, some output pixels receive more than one input image pixel, whereas other output pixels do not receive any input image pixels at all.

Due to the disadvantages of the forward mapping method, in practice a different technique, called reverse mapping, is used.

Reverse Mapping

Reverse mapping iterates over each pixel on the grid of the output image, uses the inverse transformation 𝑇-1{…} to determine its corresponding position in the input image, samples the pixel value at this position, and uses that value as output pixel:

Reverse mapping (Image by author)

This method completely avoids problems with overlaps and holes. Reverse mapping is also the method used in spatial transformers.

As we have seen in the animation above, most of the time the determined positions in the input image (𝑦, 𝑥) don’t lie on the grid of the input image. In order to get an approximate value for the input image at these undefined non-integer positions, we must interpolate the pixel value. An interpolation technique called bilinear interpolation will be introduced in the next post.

Multiple Channels

So far, we have demonstrated the principles of forward mapping and reverse mapping on inputs with a single channel 𝐶=1, as encountered in e.g. grayscale images. However, most of the time inputs will have more than one channel 𝐶 > 1, such as RGB images which have three channels or feature maps in deep learning architectures which can have an arbitrary number of channels.

The extension is simple: for multi-channel inputs, the mapping is done identically for each channel of the input, so every channel is transformed in an identical way. This way we preserve spatial consistency between channels. Note, that spatial transformations do not change the number of channels 𝐶, which is same in input and output maps.

The next part of this series will be published on September 13th.


Original Paper
Translation invariance of CNNs
Data Visualization
Spatial Transformations


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: