Object Detection Explained: YOLO v1.



Original Source Here

Marten Newhall via Unsplash

Object Detection Explained: YOLO v1.

Hard concepts in a simple language.

Object detection consists of two separate tasks that are classification and localization. This time you are reading about another series of object detectors, which stands for You Only Look Once. As SSD, YOLO is a single neural network that predicts bounding boxes and class probabilities directly from the input images. There are many articles briefly explaining every version of YOLO; however, I wanted to provide more technical details, so I am breaking it down into several articles.

Previous:

RCNN

Fast RCNN

FPN

Faster RCNN

SSD

Unified Detection

paper: https://arxiv.org/pdf/1506.02640.pdf

The YOLO system breaks the input image into an SxS grid. If the center of an object falls into a grid cell, the cell is assigned for detecting it. Moreover, YOLO predicts B bounding boxes and confidence scores per grid cell. The confidence score represents how likely a bounding box contains an object (not a category). Thus, each bounding box has 5 predictions: x, y, w, h and confidence score, where (x, y) are relative to a grid cell and (w, h) is relative to the whole image.

The authors define confidence score as Probability(Object) * IOU with the ground truth. So, if a cell does not contain an object, the confidence score should be zero; otherwise, it should be equal to the IOU between the predicted and the ground truth box. Additionally, the YOLO predicts C class probabilities per grid cell (regardless of the number of bounding boxes). Therefore, there are S * S * (B * 5 + C) predictions.

For evaluating YOLO on Pascal VOC, the parameters were set as follows: S = 7, B = 2 and C = 20, meaning there are 7 * 7 * (2 * 5 + 20) predictions per image.

Network Design

The network consists of 24 convolutional layers and 2 fully connected layers. Also, a Leaky ReLU is used as an activation function.

Architecture:

The code above demonstrates Pytorch implementation of the architecture. A brief description of the CNN can be seen from arctitecture_cfg, which shows 24 convolutional blocks accompanied by Maxpooling layers. Notice that some pairs of blocks are repeated 4 or 2 times. Finally, _create_fcs method shows 2 fully connected layers and the Leaky ReLU activation function. As stated the architecture should output S * S * (C + B * 5) predictions.

Training

Loss function. Paper: https://arxiv.org/pdf/1506.02640.pdf

The figure above shows a loss function proposed by the authors. As seen, there are two hyperparameters lambda_coord and lambda_noobj, which are set to 5 and .5 respectively. Due to lambda_noobj, the authors overcome the problem of having a large number of cells not containing any object, which can lead to model instability. Also, a square root is taken so that small deviations in large boxes matter less than in small boxes.

Some Last Words

YOLO v1 is a simple and fast object detector that is suitable for applications that rely on real-time predictions. However, there are several limitations regarding this architecture. For example, each grid cell can only have one class. Also, there are some generalization problems. Therefore, we are now going to study the whole YOLO series and see how it was improved. Take your time to understand the code, I left comments so it is easier to follow along. Also, take a look at the original paper and make sure you understand it clearly. Thank you!

Original Paper: https://arxiv.org/pdf/1506.02640.pdf

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: