The Hitchhiker’s Guide to Optimization in Machine Learning

Original Source Here

Gradient Descent Algorithm

Gradient descent is one of the easiest to implement (and arguably one of the worst) optimization algorithms in machine learning. It is a first-order (i.e., gradient-based) optimization algorithm where we iteratively update the parameters of a differentiable cost function until its minimum is attained.

Before we understand how gradient descent works, first let us have a look at the generalized formula of GD:

Gradient descent (Image by author)

The basic idea here is to update the model parameters in the direction of the negative gradient of the loss function w.r.t. the parameter. To give you a better intuition, you can think of gradient descent as follows:

Let us imagine that our cost function is a mountainous region, and the aim of our optimization function, gradient descent, is to find the valley in this mountainous region (i.e., the minima of the cost function). Continuing with the analogy of the mountainous region, let’s assume that a blindfolded person is dropped somewhere on the mountain, and is asked to get to the base of the mountain.

Now, owing to the blindfold, naturally, the person won’t know the direction to the base of the mountain. Therefore, the best information available to the person regarding the direction to the base will be the direction of the steepest slope at every step they take. Hence, following the gradient descent approach, our blindfolded person will move in such a way that each step that they take will be in the direction of the steepest descent.

The descent down a mountain (Image by author)

Basically, the gradient descent algorithm follows a greedy search strategy, where there might be a better solution in some other direction, however, the parameters get updated only in the direction relative to the gradient (or slope).

This search strategy however has a major flaw associated with it! Since it’s a greedy algorithm, there are high chances that the algorithm will run into a local minimum instead of a global minimum, and hence we might never converge to the optimal parameter values for the least model cost.

Effect of Learning Rate (γ)

When it comes to parameter updation in the gradient descent algorithm, the value of the hyperparameter γ, which denotes the step size (or learning rate) plays a key role.

  • If γ is too low, the chance of reaching an optimal solution will increase. However, the convergence rate, i.e., how fast the minimum is reached, will be decreased drastically. This simply means that the number of epochs the model will have to be trained for will increase when the learning rate is too low.
γ is too small (Image by author)
  • If γ is too large, the probability of attaining the optima is reduced as the model parameters tend to oscillate around the minima. However, the convergence rate increases as a result of the larger step size. Another noticeable disadvantage of a high learning rate is a phenomenon called divergence, which simply means loss explosion.
γ is too large (Image by author)

A practical way to deal with this trade-off between a high learning rate and a low learning rate is to have a variable learning rate— A larger learning rate for the initial epochs, then a reduced learning rate for the later epochs as we proceed further down the model training process. This will have the advantage of both a high learning rate (faster convergence) and a low learning rate (higher probability of reaching the optima).

While gradient descent is easy to implement, its drawback of being prone to frequently run into local optimum is something too significant to overlook.

What good is an optimization algorithm if it can’t actually minimize the model cost?

Hence, researchers came up with the modified, more optimized version of gradient descent- Stochastic Gradient Descent.


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: