High Variance to High Bias via “Perfection”

Original Source Here

The Data Science community is awarded many platforms hosting lots of predictive modeling problems. This has simplified the path for beginners to excel and attain proficiency in this field. We are not going to talk about those platforms but talk about something that will let us end our journey at training an “optimal” model. The term “optimal” here means that the accuracy of the model is similar to base accuracy.

The most common problem we face while training a model is the overfitting and the underfitting of data. We have, up to some extent, the power to control it, but then these powers in deep learning are undiscovered by most of us. However, this article is not just about the methods to deal with the data problems, it’s about why these techniques are so powerful and what they are doing backstage.

Just for a smooth start, let us define some terms used for a predictive modeling problem for classifying dogs from a set of images.

Different terms used for types of performance of a model (Published by author)

We have assumed the base error to be zero because humans can identify dogs with an error of 0%. or in the vicinity of it. To understand this article, it is necessary to be familiar with these terms (prerequisites!!).

To have an in-depth understanding of these terms, let us see the decision boundary or function mapped by these models because: To solve any problem, we need to understand it first.

There is a little assumption that we have only two features, which will help us plot the data points in a 2-D plane.

Decision boundary for three different models describing different problems with data fitting, (Published by author)

Optimal Model

This model accurately fits the given data points and also incorporated the noises in the data. The function mapped by this model is neither very complex nor very simple. Let’s assume it is comparable to a parabolic function, where the train and test set errors are also very comparable to base error.


When the train set error and test set error, are far from base error, then the model is said to have under-fitted the data. The function mapped by this model is quite simpler as a linear function, hence less complex than a parabolic function.


When the test set error is far from the base error while the train set error is not, then the model is said to have over-fitted the data. This model has mapped a very complex function than a parabola.

These observations for mapped functions are vital for understanding the problems with the data. In a deep neural network, we have several hyperparameters to tune. Some of the hyperparameters, such as the number of hidden layers and the number of hidden units in a hidden layer determine the complexity of function mapped by the model on given data points.

Let us assume that we have trained a very deep neural network and we have a condition of overfitting of data and are using the L2 regularization technique to reach optimal fitting.

L2 Regularization

We have the value of cost function as follows

The cost function defined in gradient descent for a binary classification problem (Published by author)

where J is the cost function, y with a cap is the predicted value, and y is the actual value of the target feature for the ith training example. In L2 regularization, we add a regularization term in the cost function. So, J becomes-

The cost function with a regularization term (Published by author)

where Lamba is called regularization parameter, hence another hyperparameter and the term in the summation is square of the L2 norm of the weight matrix. Just for a recap cap, the definition of the L2 norm is given in the above image.

To understand the effect of this regularization term, let us calculate the gradients, which we calculate during the process of backpropagation.

Gradients of weights calculated in backpropagation step (Published by author)

The “backprop term” corresponds to the derivative of the cost function with respect to the weight matrix. We can assume that it will remain the same for a particular iteration of gradient descent. An implication of adding a regularization term is the addition of a new term in the definition of dW concerning Lamba and m. Now, if we update the parameters them we would have the following implications:

Updating the weights (Published by author)

It can be straight away inferred that when we apply L2 regularization, we are decreasing or minimizing the elements in W. In most of the methods dealing with problems like high variance, we are reducing the weights. That is why it is also called a ‘Weight Deacy’ process.

If we have a very deep neural network with lots of hidden layers and hidden units, then it will tend to overfit the data. Once we apply L2 regularization, indirectly we are reducing the weights of some hidden units to nearly zero but not exactly zero.

Now suppose we have a dense neural net as follows, and we apply L2 regularization, nearly shutting off some hidden units.

Neurons shutting down after adding a regularization term in the cost function (Published by author)

This is an extreme condition to completely shut off some neurons but will help us understand the implications of L2 regularization. Now the remaining neural network can be combined as follows-

Multiple sequential neural units can be replaced by a single neural unit. (Published by author)

This is possible because each neuron varies linearly with the weights and the biases and if we input a linear function in another linear function, then the resulting function is also linear. At the end of the day, we are just altering the constants of the linear function.

This extreme case implies that from a very complex function (generated by a dense neural net), we landed at a very less complex linear function when we apply the L2 regularization technique. In other words, from a model overfitting the data to a model underfitting the data or from a model with a high variance to a model with high bias.

Another observation can be made that the optimal solution for our problem lies in between our path from overfitting to underfitting the data. This destination can easily be achieved by tuning the hyperparameter lambda (not that easy because tuning hyperparameters needs patience).

High variance to high bias via ‘Perfection’ (Published by author)

There are other regularization techniques like Inverse Dropout (or simply dropout) regularization, which randomly switch off the neural units. All these regularization techniques are doing the same job of minimizing the complexity of cost function or the mapped function.

Review about “Early Stopping” method

Early stopping” is another method that is used popularly to avoid overfitting data. In this method, we define a training set and a test or development set and see the variation of cost function on the two sets concerning the number of iteration.

With an increase in the number of iteration, errors in the train set will reduce but the variation in the test set is notable. It first dips down that show an increment in cost value indicating the start of the overfitting of data.

Variation of cost of training and test sets with the number of iterations (Published by author)

This method maybe not that accurate because we have two different tasks one, optimizing cost function J, and second, preventing overfitting. Both of these tasks must be solved separately. ‘Early stopping’ does both the task simultaneously hence it does optimize the cost function efficiently.

By using other techniques as regularization, you will be able to optime your cost function with more confidence and accuracy.

A Common Solution

A common solution to both of the problems is providing more training data set. Sometimes, getting more training data can be costly. However, there are some methods like Data Augmentation which generates more training data set.

For example, we can generate new images from pre-existing images by randomly zooming, cropping, or flipping the images.

Data Augmentation (original photo by Victor Grabarczyk on Unsplash)

Summing up

We have seen how we can reach an optimal model in the path from high variance to the high bias model.

However, we should note that L2 regularization decays the weight close to zero but not exactly zero, whereas, in dropout regularization, we randomly switch off some units.

There are various techniques other than regularization, such as normalization, gradient checking, etc which help you optimizing your cost function and preventing overfitting of data, but all these methods are in the same box i.e. they all are reducing the complexity of cost function or the function mapped by the model on data points. Once the complexity is reduced to certain limits, an optimal solution can be found out.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: