Comprehend Dropout: Deep Learning by doing toy examples

Original Source Here

Comprehend Dropout: Deep Learning by doing toy examples

Dropout is one of the main regularization techniques in deep neural networks. This story helps you deeply understand what Dropout is and how it works.

Fully Connected network (Created by Author)

In Deep Learning, especially in Object Detection, overfitting can easily happen. Overfitting means the model is very complex such that it fits the train set very well but fails on the test set. Failing means it sometimes detects even noises in a test image.

In Object detection, it is common to train with a pretrained backbone or continue training with a pretrained model. That’s why the loss of validation goes higher than the loss of train after a few epochs. In such cases, adding a Dropout layer is helpful.

In Pytorch, we can add a Dropout layer simply by:

from torch import nn
dropout = nn.Dropout(p=0.2)

But what happens under the hood?

The Dropout Regularization Scheme

The Dropout technique creates a sub-neural network from the original one by selecting some neurons in the hidden layers. The selection is resampling the nodes in the neural network (only nodes in the hidden layers) and defining some masks.

The Dropout is not for bias nodes! The Dropout is a regularization technique, and the idea is to reduce overfitting caused by weights. Therefore, regularization is not for the bias nodes because they don’t receive any input. Therefore, dropping them out does not help to improve the predictions.

Toy Example

Consider the following fully connected network, wherein the activation function is the ReLU. (To see how the network is related to matrices, watch the above GIF.)

Dropout toy example

In this example, the goal is to predict for x = (1 , 1) using the following dropout masks.

Dropout mask

To compute the prediction, we should calculate the values in the hidden layer.

The first hidden layer:

where g⁽¹⁾ is the ReLU activation function, then

Note: The first dropout layer, μ⁰, is one for all nodes. Therefore, it doesn’t have any effect on the result.

The second hidden layer:

where g⁽²⁾ is the ReLU activation function, then

The third hidden layer:

where g⁽³⁾ is the ReLU activation function, then

The output layer:


Compute the prediction if the dropout masks are:

The final answer is:


This story helps you understand the dropout technique. I plan to add more such toy examples in Machine Learning and Deep Learning. Therefore, stay tuned if you like to read more 😊.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: