Paper Reading — Object Detection in 20 Years: A Survey (Part 2)



Original Source Here

Fast R-CNN (2015)

Original paper: Fast R-CNN

Overview

We propose a new training algorithm that fixes the disadvantages of R-CNN and SPPnet while improving their speed and accuracy. The Fast R-CNN method has several advantages:

  • Higher detection quality (mAP) than R-CNN, SPPnet.
  • Training is single-stage, using a multi-task loss. Unlike R-CNN and SPPnet, training is a multi-stage pipeline that involves extracting features, fine-tuning a network with log loss, training SVMs, and finally fitting bounding-box regressors.
  • Training can update all network layers. Unlike SPPnet, the fine-tuning algorithm cannot update the convolutional layers that precede the spatial pyramid pooling, which limits the accuracy of deep networks.
  • No disk storage is required for feature caching. For both R-CNN and SPPnet, during SVM and bounding-box regressor training stage, features are extracted and written to disk. These features require hundreds of gigabytes of storage.

Although Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by proposal detection. Then, a question naturally arises: “can we generate object proposals with a CNN model?”

Fast R-CNN architecture

Figure 1. Fast R-CNN architecture.

An input image and multiple regions of interest (RoIs) are input into a fully convolutional network. Each RoI is pooled into a fixed-size feature map and then mapped to a feature vector by fully connected layers (FCs). The network has two output vectors per RoI: softmax probabilities and per-class bounding-box regression offsets. The architecture is trained end-to-end with a multi-task loss.

The RoI pooling layer

The RoI pooling layer uses max pooling to convert the features inside any valid region of interest into a small feature map with a fixed spatial extent of H×W (e.g., 7 × 7), where H and W are layer hyper-parameters that are independent of any particular RoI.

Each RoI is defined by a four-tuple (r, c, h, w) that specifies its top-left corner (r, c) and its height and width (h, w). RoI max-pooling works by dividing the h×w RoI window into a H×W grid of sub-windows of approximate size h/H ×w/W and then max-pooling the values in each sub-window into the corresponding output grid cell. Pooling is applied independently to each feature map channel, as in standard max pooling.

The RoI layer is simply the special case of the spatial pyramid pooling layer used in SPPnets in which there is only one pyramid level.

Initializing from pre-trained networks

We experiment with three pre-trained ImageNet networks. When a pre-trained network initializes a Fast R-CNN network, it undergoes three transformations.

  • First, the last max-pooling layer is replaced by an RoI pooling layer that is configured by setting H and W to be compatible with the net’s first fully connected layer (e.g., H = W = 7 for VGG16).
  • Second, the network’s last fully connected layer and softmax are replaced with the two sibling layers (a fully connected layer and softmax over K+1 categories and category-specific bounding-box regressors).
  • Third, the network is modified to take two data inputs: a list of images and a list of RoIs in those images.

Multi-task loss

Fast R-CNN uses a streamlined training process with one fine-tuning stage that jointly optimizes a softmax classifier and bounding-box regressors, rather than training a softmax classifier, SVMs, and regressors in three separate stages. Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression:

  • The first task loss, L_cls(p, u)=−log p_u is log loss for true class u. Here p =(p0, . . . , pK) is a discrete probability distribution (per RoI) over K + 1 categories.
  • The second task loss, L_loc, is defined over a tuple of true bounding-box regression targets for class u, v=(v_x, v_y, v_w, v_h), and a predicted tuple t_u=(tu_x , tu_y , tu_w, tu_h) for class u. The Iverson bracket indicator function [u ≥ 1] evaluates to 1 when u ≥ 1 and 0 otherwise. By convention, the catch-all background class is labeled u=0. Thus for background RoIs, L_loc is ignored.

For bounding-box regression, we use a robust L1 loss that is less sensitive to outliers than the L2 loss used in R-CNN and SPPnet. When the regression targets are unbounded, training with L2 loss can require careful tuning of learning rates in order to prevent exploding gradients.

The hyper-parameter λ in Eq. 1 controls the balance between the two task losses. We normalize the ground-truth regression targets v_i to have zero mean and unit variance. All experiments use λ=1.

Mini-batch sampling

We propose a more efficient training method that takes advantage of feature sharing during training. In Fast R-CNN training, stochastic gradient descent (SGD) mini-batches are sampled hierarchically, first by sampling N images and then by sampling R/N RoIs from each image. Critically, RoIs from the same image share computation and memory in the forward and backward passes. Making N small decreases mini-batch computation. For example, when using N=2 and R=128, the proposed training scheme is roughly 64× faster than sampling one RoI from 128 different images (i.e., the R-CNN and SPPnet strategy).

One concern over this strategy is it may cause slow training convergence because RoIs from the same image are correlated. This concern does not appear to be a practical issue and we achieve good results with N=2 and R=128 using fewer SGD iterations than R-CNN.

In detail, each SGD mini-batch is constructed from N=2 images, chosen uniformly at random. We use mini-batches of size R=128, sampling 64 RoIs from each image. We take 25% of the RoIs from object proposals that have intersection over union (IoU) overlap with a ground-truth bounding box of at least 0.5. These RoIs comprise the examples labeled with a foreground object class, i.e. u ≥ 1. The remaining RoIs are sampled from object proposals that have a maximum IoU with ground truth in the interval [0.1, 0.5). These are the background examples and are labeled with u=0. The lower threshold of 0.1 appears to act as a heuristic for hard example mining. During training, images are horizontally flipped with a probability of 0.5. No other data augmentation is used.

Back-propagation through RoI pooling layers

For clarity, we assume only one image per mini-batch (N=1). Let x_i ∈ R be the i-th activation input into the RoI pooling layer and let y_rj be the layer’s j-th output from the r-th RoI. The RoI pooling layer computes y_rj = x_i*(r, j), in which i*(r, j)=argmax (i’ ∈ R(r, j)) x_i’. R(r, j) is the index set of inputs in the sub-window over which the output unit y_rj max pools. A single x_i may be assigned to several different outputs y_rj. The RoI pooling layer’s backwards function computes the partial derivative of the loss function with respect to each input variable x_i by following the argmax switches:

For each mini-batch RoI r and for each pooling output unit y_rj, the partial derivative ∂L/∂y_rj is accumulated if i is the argmax selected for y_rj by max pooling. In back-propagation, the partial derivatives ∂L/∂y_rj are already computed by the backwards function of the layer on top of the RoI pooling layer.

Scale invariance

We explore two ways of achieving scale-invariant object detection.

  • “Brute force” learning. Each image is processed at a pre-defined pixel size during both training and testing. The network must directly learn scale-invariant object detection from the training data.
  • Using image pyramids. This method provides approximate scale-invariance to the network through an image pyramid. At test-time, the image pyramid is used to approximately scale-normalize each object proposal. During multi-scale training, we randomly sample a pyramid scale each time an image is sampled, as a form of data augmentation.

Test time object detection

At test-time, R is typically around 2000. When using an image pyramid, each RoI is assigned to the scale such that the scaled RoI is closest to 224×224 pixels in the area. For each test RoI r, the forward pass outputs a class posterior probability distribution p and a set of predicted bounding-box offsets relative to r (each of the K classes gets its own refined bounding-box prediction). We assign detection confidence to r for each object class k using the estimated probability:

We then perform non-maximum suppression independently for each class using the algorithm and settings from R-CNN.

Truncated SVD for faster detection

Figure 2. Timing for VGG16 before and after truncated SVD.

For whole-image classification, the time spent computing the fully connected layers is small compared to the conv layers. On the contrary, for detection the number of RoIs to process is large and nearly half of the forward pass time is spent computing the fully connected layers (Before SVD, fully connected layers fc6 and fc7 take 45% of the time). Large fully connected layers are easily accelerated by compressing them with truncated SVD. In this technique, a layer parameterized by the weight matrix W (u×v) is approximately factorized as

using SVD. In this factorization, U is a u×t matrix comprising the first t left-singular vectors of W, Σt is a t×t diagonal matrix containing the top t singular values of W, and V is v×t matrix comprising the first t right-singular vectors of W. Truncated SVD reduces the parameter count from uv to t(u + v), which can be significant if t is much smaller than min(u, v).

To compress a network, the single fully connected layer corresponding to W is replaced by two fully connected layers, without a non-linearity between them. The first of these layers use the weight matrix ΣtV_T (and no biases) and the second uses U (with the original biases associated with W). This simple compression method gives good speedups when the number of RoIs is large.

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: