All you need to know about Gradient Descent



Original Source Here

All you need to know about Gradient Descent

Introduction

Greetings, In this blog, I will be talking about gradient descent. It is one of the basic topics that must be known to the person who is studying machine learning. I will try to explain it in a very simple way and different types of Gradient Descent with mathematical equations. Let’s get started!

What is Gradient Descent?

Gradient Descent is a generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to update weight parameters iteratively to minimize a cost function.

Suppose you are lost in the mountains in a dense fog; you can only feel the slope of the ground below your feet. A good strategy to get to the bottom of the valley quickly is to go downhill in the direction of the steepest slope. This is exactly what Gradient Descent does: it measures the local gradient of the error function with regards to the parameter vector θ, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum!

Generally, at first, we initialize θ values randomly and then taking baby steps to minimize cost function (e.g MSE or RMSE) until we reach a global minimum.

Learning rate is the most important parameter in Gradient Descent. It determines the size of the steps. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time.

Learning rate too low.

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution.

Learning rate too high.

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step.

Finally, not all cost functions look like a nice regular bowl. There may be holes, ridges, plateaus, and all sorts of irregular terrains, which makes it difficult to find a global minimum.

Challenges with Gradient Descent.

The above image shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

To implement Gradient Descent, we need to compute how much the cost function will change if we change θj just a little bit. This is called a partial derivative. The below equation is for only one feature of a row. We have to calculate this for each row of each feature.

Partial derivatives

Types of Gradient Descent

Batch Gradient Descent

In simple Gradient Descent, we calculate each instance partial derivative individually and taking the average as shown in the above equation. Batch Gradient Descent computes these partial derivatives all in one go, i.e we only calculate one value of gradient descent to update θ as shown below.

Note that each feature will have its own theta value and will be updated by Gradient Descent. This formula involves calculations over the full training
set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result, it is terribly slow on very large training sets.

After calculating Gradient Descent it’s time to update θ values using the below equation.

Gradient Descent step

Now, let’s implement Batch Gradient Descent using python:

import numpy as np
# generating random data
X = 2 * np.random.rand(100, 1)
y = 4 + 3 * X + np.random.randn(100, 1)
eta = 0.1 # learning rate
n_iterations = 1000
m = 100 # size of dataset
theta = np.random.randn(2,1) # random initialization
for iteration in range(n_iterations):
gradients = 2/m * X_b.T.dot(X_b.dot(theta) - y) # batch GD
theta = theta - eta * gradients # step
print(theta)
# array([[4.21509616],
# [2.77011339]])

Stochastic Gradient Descent

Stochastic Gradient Descent (SGD) picks a random instance in the training set at every step and computes the gradients based only on that single instance. SGD overcomes the Batch GD issue, that it uses the whole training set to compute the gradients at every step which makes it very slow when the training set is large.

SGD algorithm is much less regular compare to the Batch GD, instead of decreasing gently to the minimum, the cost function will bounce up and down, decreasing only on average. So using SGD we get good values, not optimal values every time.

To get optimal value and reduce bounce up and down in SGD we can gradually decrease the learning rate. The steps start large the get smaller and smaller, allowing the algorithm to settle at the global minimum. The function that determines the learning rate at each iteration is called the learning schedule.

Note that if we reduce the learning rate too quickly, it may get stuck in a local minimum or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly it may jump around the minimum for a long time and end up with a suboptimal solution if we halt training too early.

Let’s dive into the implementation of SGD:

n_epochs = 50t0, t1 = 5, 50 # learning schedule hyperparametersdef learning_schedule(t):
return t0 / (t + t1)
theta = np.random.randn(2,1) # random initializationfor epoch in range(n_epochs):
for i in range(m):
# getting a random instance
random_index = np.random.randint(m)
xi = X_b[random_index:random_index+1]
yi = y[random_index:random_index+1]
# calulating gradient on single instance
gradients = 2 * xi.T.dot(xi.dot(theta) - yi)
# getting learning rate
eta = learning_schedule(epoch * m + i)
# updating theta
theta = theta - eta * gradients
print(theta)
# array([[4.21076011],
# [2.74856079]])

Mini-batch Gradient Descent

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient Descent. Mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs. Mini-batch GD may face difficulty in escaping from local minima.

The below graph shows the path taken by three GD algorithms in parameter space during training. They all end up near the minimum, but Batch GD’s path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around. However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic GD and Mini-batch GD would also reach the minimum if you used a good learning schedule.

So that’s all about Gradient Descent. Please share your views and comment down below if I missed any important points. Thank you 🙂

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: