Object Detection Explained: Single Shot MultiBox Detector

Original Source Here

Object Detection Explained: Single Shot MultiBox Detector

Hard concepts in a simple language.

Object detection consists of two separate tasks that are classification and localization. Last time I covered the R-CNN series of object detectors. The R-CNN series of object detectors consist of two stages, which are Region Proposal Network and the Classification and Box Refinement heads. However, now we are moving on to one-stage object detectors. In this article, I would like to introduce Single Shot MultiBox Detector (SSD).





Faster RCNN

Bounding Box Regression

As in Faster R-CNN, the authors regress to offsets for the center (cx, cy) of the default bounding box (d) and for its width (w) and height (h). Thus, the formula looks as follows:

Bounding box regression. Paper: https://arxiv.org/pdf/1512.02325.pdf


SSD Architecture. Paper: https://arxiv.org/pdf/1512.02325.pdf

The figure above shows the architecture based on VGG-16 as a backbone. I am going to explain the architecture by breaking it down into 3 parts: Backbone, Auxiliary Convolutions and Prediction Convolutions. I am also going to provide pieces of code for your convenience.

Base Network

I would like to highlight that the following examples are provided under the assumption that the input image has the size of 300 by 300, as in the original paper.

As it can be seen we are using a simple and well-known VGG-16 network to extract features of conv4_3 and conv7. Also, we can notice that the feature dimensions are (N, 512, 38, 38) and (N, 1024, 19, 19) respectively. I hope this part is straightforward and clear enough to move on to Axuliary Convolutions

Auxiliary Convolutions

Auxiliary Convolutions allow us to get additional features on top of our base VGG-16 network. These layers decrease in size progressively and allow predictions of detections at multiple scales. Thus, the input we pass into the network is the conv7 features obtained from VGG-16 network. As it is seen while applying convolutions and ReLU activation functions, we should keep intermediate features, which are conv8_2, conv9_2, conv10_2 and conv11_2. Please take your time to look at the code and the dimensions of the feature maps 🙂

Choosing default boundary boxes

This might sound scary but do not worry, it is still simple to grasp. Default boundary boxes are chosen manually. Each feature map layer is assigned a scale value. For example, Conv4_3 detects objects at the smallest scale of 0.2 (or 0.1 sometimes) and then increases linearly up to a scale of 0.9 for conv11_2 (obtained from Auxiliary Convolutions). Also, as we can notice there is a defined number of prior boxes we are considering per position in each feature map. For layers making 4 predictions, SSD uses 4 different aspect ratios, which are 1, 2, 0.5 and sqrt(s_k * s_(k+1)), where s_k is a scale value of the kth feature map. Generally, it is defined as an additional scale computed for the aspect ratio of 1. Then the width and the height of the default boxes are calculated as follows:

Now, let’s summarize it with the following piece of code.

where it returns 8732 prior boxes for 8732 predictions that are made by SSD.

Prediction Convolutions

This might look complex but it basically gets all the feature maps we get from the base VGG-16 and the Auxiliary Convolutions and applies convolution layers in order to predict classes and bounding boxes for each feature map. Take your time to understand it and make sure you are following along by paying attention to the dimensions of the feature maps.

Wrap Up

Now let’s put it all together and look at the final architecture that looks as follows.

Notice that lower level features (conv4_3_feats) have considerably larger scales, hence we take the L2 norm and rescale it. Rescale factor is initially set at 20, but is learned for each channel during back-prop.


Loss. Paper: https://arxiv.org/pdf/1512.02325.pdf

As it is seen, we are already familiar with it from the previous articles on the R-CNN series. The localization loss is L1 smooth loss, whereas the classification loss is a well-known Cross-Entropy loss.

Matching strategy

During training, we need to determine which of the generated prior boxes should correspond to our ground truth boxes to be included in the loss calculation. So, we match each ground truth box with a prior box having the highest Jaccard overlap. Additionally, we also pick prior boxes having an overlap of at least 0.5 to allow the network to predict high scores for multiple overlapping boxes.

Hard negative mining

After the matching step, most of the prior/default boxes are used as negative samples. However, in order to avoid the imbalance between positive and negative samples, we keep the ratio of at most 3:1 since it leads to faster optimization and stable learning. Once again, localization loss is computed only over positive (non-background) priors.

Some Last Words

I hope I managed to make SSD easy to understand and grasp. I tried to use codes so that you are able to visualize the process. Take your time to understand it. Also, it would be even better if you try to use it on your own. Next time I will be writing about the YOLO series of object detectors.

Original Paper: https://arxiv.org/pdf/1512.02325.pdf


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: