Techniques for handling underfitting and overfitting in Machine Learning*rOUERWWLoiZsqkzW

Original Source Here

Techniques for handling underfitting and overfitting in Machine Learning

Photo by Pietro Jeng on Unsplash

I’ll be talking about various techniques that can be used to handle overfitting and underfitting in this article. I’ll briefly discuss underfitting and overfitting, followed by the discussion about the techniques for handling them.


In one of my earlier articles, I talked about the bias-variance trade-off. We talked about the bias-variance relation to model complexity and how underfitting and overfitting looks like. I would encourage you to read the article if you don’t understand these terms:

For a quick recap let us look at the following figure.

Source: Underfitting, Optimal-fitting and Overfitting for linear regression [1]

Underfitting happens when the model has a very high bias and is unable to capture the complex patterns in the data. This leads to higher training and validation errors since the model is not complex enough to classify the underlying data. In the above example, we see that the data has a second-order relation but the model is a linear model so it won’t be able to

Overfitting is the opposite in the sense that the model is too complex (or higher model) and captures even the noise in the data. Therefore in this case one would observe a very low test error value. However, when it would fail to generalise to both the validation and test sets.

We want to find the optimal fitting situation where the model has a smaller gap between the training and validation error values. It should better generalization than in the other two cases.

How to handle underfitting

  1. In this situation, the best strategy is to increase the model complexity by either increasing the number of parameters of your deep learning model or the order of your model. Underfitting is due to the model being simpler than needed. It fails to capture the patterns in the data. Increasing the model complexity will lead to improvement in training performance. If we use a large enough model it can even achieve a training error of zero i.e. the model will memorize the data and suffer from over-fitting. The goal is to hit the optimal sweet spot.
  2. Try to train the model for more epochs. Ensure that the loss is decreasing gradually over the course of the training. Otherwise, it is highly likely that there is some kind of bug or problem in the training code/logic itself.
  3. If you aren’t shuffling the data after every epoch, it can harm the model performance. Ensuring that you are shuffling the data is a good check to perform at this point.

How to handle overfitting

In contrast to underfitting, there are several techniques available for handing overfitting that one can try to use. Let us look at them one by one.

1. Get more training data: Although getting more data may not always be feasible, getting more representative data is extremely helpful. Having a larger diverse dataset usually aids in the model performance. You can get a better model that may generalize better. This means that the performance of the model on unseen data (true test set) will be better.

2. Augmentation: If you can’t get more data, you can try augmentation to add variation in your data. Augmentation means artificially modifying your existing data by means of transforms that resemble the variation you might expect in the real data. For imagery data, is a very comprehensive library that gives you a ton of augmentation methods. It allows you to compose powerful sequences of augmentations quickly and efficiently. I would recommend the following two articles for further reading. Olga Chernytska has a detailed article on image augmentation which you should consider reading. Valentina Alto nicely explains how image augmentation is done in Keras in this article.

3. Early stopping[2,3]: Early stopping is a form of regularization to avoid overfitting when training a learner with an iterative method, such as gradient descent [2]. While training neural networks, we iteratively use gradients from the training data and try to make the model fit better approximate the underlying real-world function. In a sense, this method allows you to stop at or near the optimal fitting point. Thereby preventing overfitting to the training set and reducing generalization error. To decide when to stop, we can monitor certain metrics such as the loss, test_accuracy, val_accuracy and depending on certain conditions being met stop the training.

4. Regularization L1, L2: Regularization is an additional term that is added to the loss function to impose a penalty on large network parameter weights to reduce overfitting. L1 and L2 regularization the two widely used techniques. Although they penalize large weights, they both achieve the regularization differently.
L1 Regularization: L1 regularization adds a scaled version of the L1 norm of the weight parameters to the loss function. The equation for L1 regularization is:

where Lreg = regularized loss, E(W) = error term, λ is a hyperparameter, ||W||₁ is the L1 norm of the weights
Now, even if the error term would be zero, so long as the weights are non zero we still will have a +ve high Lreg value. And since the objective of the optimization problem is to minimise Lreg, setting weights to zero would lead to a lower loss. And more zeros in the weights means more sparsity. There are geometric interpretations available to show that the sparse solutions are more. You can watch/read following video/articles:

  2. Sparsity and the L1 Norm
  4. Regularization for Sparsity: L₁ Regularization

L2 Regularization: We add the squared L2 norm of the weights to the cost/loss/objective function. The equation for L2 regularization is as follows:

where Lreg = regularized loss, E(W) = error term, λ is a hyperparameter known as regularization rate, ||W||₂ is the L2 norm of the weights
The derivative of this equation leads to the following term in the weight update equation during the optimization:

where η is the learning rate

We see that the old weights are scaled by (1-ηλ) or are decayed with every gradient update. L2 regularization, therefore, leads to smaller weights. It is also sometimes referred to as weight decay because of this. For a detailed explanation, I would strongly recommend you read this article from the google machine learning crash course: Regularization for Simplicity: L₂ Regularization

Dropout [4]: The main idea of this technique is to randomly drop units from the neural networks during training. It was presented in the following paper: Dropout: A Simple Way to Prevent Neural Networks from Overfitting (2014) by Srivastava et al. During training, the random dropout samples form a large number of different “thinned” networks. The dropout is achieved by constructing a matrix by drawing from a Bernoulli distribution with probability p (for getting 1) [i.e. the dropout probability is 1-p] and then doing an element-wise multiplication with the outputs of hidden layers. The following figure shows dropout during the training phase.

Source: [4]

Dropout ensures that no neuron ends up relying too much on other neurons and learns something meaningful instead. Dropout can be applied after convolutional, pooling, or fully-connected layers.

One more thing to keep in mind is that since all the neurons aren’t active all the time during the training but with a probability p, the weights need to be scaled with that value during inference time. You can read more about the scaling requirement in the paper or from this article: CS231n Convolutional Neural Networks for Visual Recognition

Source: [4]

DropConnect: This technique is like taking the dropout to the next level. Instead of randomly dropping nodes, we randomly drop weights. So instead of turning off all the connections of a node, we cut off certain random connections. This means that a fully connected layer with DropConnect becomes a sparsely connected layer in which connections are chosen at random during the training stage [5]. For a DropConnect layer the output is given as:

where r is the output of the layer, v is the input to the layer, W are the weight parameters, M is the mask matrix to cutoff random connections which are drawn from Bernoulli distribution with probability p. Each element of the mask M is drawn independently for each example during training. As a result of these random connection drops, we get dynamic sparsity in our network weights leading to a reduction in over-fitting.


[1] Pinterest Image

[2] Early stopping


[4] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1 (January 2014), 1929–1958.

[5] Papers with Code — DropConnect Explained


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: