The Enigma of Real-time Object Detection and its practical solution

Original Source Here

The Enigma of Real-time Object Detection and its practical solution

Ever wondered how self-driving automobiles work? How do they see as an actual driver?

What is Object Detection?

Object Detection is interpreted as a technique of identifying and localising an object in a given image instance.

Sounds fancy, doesn’t it? You need to know about the basics first:

  • Object Classification
  • Object Localisation

What is Object Classification?

Classifying a given image into unique categories is called Object Classification.

Given an image, telling if it’s a car, bike or pedestrian is an apt example of Object Classification.

This image will be classified as a Car

For binary classification:

In binary classification, we have two classes, positive and negative.

Output ŷ = Probability that the given image belongs to the positive class. If ŷ ≥ given threshold value, we conclude that the given image is of the object and vice versa.

(Uses Logistic output unit, range = [0,1])

For Multi-class classification:

In multi-class classification, we have more than 2 classes to classify.

Output ŷ = an array containing probabilities of all the classes. If any of the class has a probability ≥ given threshold value, then we classify the image as that class. If multiple classes have a greater probability value than the threshold value, then the maximum probability class is chosen.

(Uses Softmax output unit, range = [0,1])

Loss function is defined as:

(m = number of training instances, y = actual label for the image, ŷ = predicted label for the image)

What is Object Localisation?

You just completed categorising your image and are satisfied, but your geek brain being himself is not satisfied as it also needs the location of the object in the image.

Well, there’s a solution to it. The concept of Object Localisation draws ‘Bounding-Box’ around the identified object in the image.

Object Localisation is pretty much similar to Object Classification, it just differs in the output layer as it also outputs the location and measurement of the bounding box.

There’s a bounding box around the car

Output ŷ = an array containing probability that if any of the class is present in the image, Softmax unit predictions of the classes, coordinates of the centre of bounding box and measurements of the bounding box. (Note: The centre coordinates and measurement of bounding-box can be greater than 1, which is not possible for the Softmax unit to output. So instead of absolute measurements, we use relative measurements)

Ex- the height of the bounding box can be defined as:

X*height of the given image.

(X is outputted by the network)

Consider an example where you have 3 classes: car, bike and pedestrian.


(P0 = Probability that image belongs to one of the given classes, Pc = Probability of image being of class car, Pb = Probability of image being of class bike, Pp = Probability of image being of class pedestrian, Bx = X-coordinate of center of bounding-box, By = Y-coordinate of center of bounding-box, Bh = Height of bounding-box, Bw = Width of bounding-box)

The loss function for each training instance is defined as:

(n = size of output array)

The cumulative loss is:

(m = number of training instances)

So, what’s the difference between Object Localisation and Object Detection?

Object Localisation is finite to only a single object instance in an image whereas Object Detection supports multiple objects in a single image. Speaking generally, Object Detection is just Object Localisation on Steroids.

Multiple bounding boxes in an image

Implementation of Object Detection

We can implement Object Detection using the good old Sliding Window approach, which takes a window of defined size and this window is placed over the image. The area of the image enclosed by the window is passed through an Objective Localisation network. This window is iteratively passed through all parts of the image. Sometimes, multiple windows of different dimensions are used. If any window detects the object, it draws a bounding box around it.

This is the implementation of the Object Detection…, wait. Read the title of the article again. It’s the real world and our good old Sliding Window is pretty slow. So what to do now?

Well, there’s a solution to accelerate it by using the Convolutional approach to the Sliding Window.

Before proceeding with that, you should know how to implement Fully-Connected layers as Convolutional layers.

Convolutional implementation of Fully-Connected layers

The above given is an architecture of a Convolutional neural network with a Fully-Connected layer output containing 4 nodes.

It’s Convolutional approach:

This implementation cuts down the computational work as Convolutional layers tend to share parameters among themselves.

Convolutional approach to Sliding Window

Suppose you define a 14 x 14 window as shown above and run this windows through the whole image iteratively. This will give us the naive implementation of the sliding window approach.

Instead of iteratively picking up every 14 x 14 region, we can apply the same network on the whole image. It gives us a 2 x 2 x 4 output, which is the output of 4 windows altogether. This cuts down the iterations needed for the output i.e. we get the output in only a single pass through the network.

In this another example, we see that applying the window network to the whole 28 x 28 image gives us 8 x 8 x 4 output i.e. output of 64 windows in a single go.

This approach is pretty fast and apt for the real world too, but it has its problems.

What if the defined size of the window doesn’t fit the object? Using multiple windows may help here but it’ll jeopardize our objective to be fast enough for being a real-world implementation.

You only look once(YOLO) algorithm and it’s working

To overcome the shortcomings of the Convolutional Sliding Window approach, YOLO divides the given image into a specific number of grids.

Let’s say for ease,we take a 3 x 3 grid( Note: the larger the grid, more is the efficiency)

The Object Localisation network will run on all grid boxes individually. The box on which the network predicts positive will draw a bounding box around the predicted object.

As you can see in the above image, the network predicted two objects i.e. one on cell 2 and another on cell 8. It can be the case when the algorithm produces multiple bounding boxes for a single object. We have a solution for that too.

Non Max-Suppression

Initially, the bounding boxes with a confidence ≤ 0.5 are eliminated. For the remaining boxes, IoU(Intersection over Union) is calculated for the bounding box of each class. To calculate IoU, we take the bounding box with maximum confidence for the given class, and then take any other box of that class. First, we calculate the Intersection between two boxes and then calculate Union between two boxes. We divide the Intersection by Union. If the Score is above the given threshold value, the box with less confidence is eliminated and if the score is below or equal to the given threshold value then both the boxes remain intact.

This gives us a single box for each object which has maximum confidence.

How good is YOLO?

YOLO is a very efficient algorithm and is the real-time practical implementation of Object Detection. It is used in the state of the art Object Detection systems and has proved its worth.

There is an advanced method of Regional Convolutional neural networks (R-CNN) which I didn’t take up in this article. You can read about R-CNN in the research paper of YOLO[2].


The ability to detect objects in real-time has been very effective to the industry and is being pertained to all kind of jobs. Like self-driving cars, Geo-satellite scanning, aid for people with weak eyesight etc.



Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: