Original Source Here

# Adagrad and Adadelta Optimizer: In-Depth Explanation

In my previous article “Optimizers in Machine Learning and Deep Learning”. I gave a brief introduction about Adagrad optimizers. In this article, I will try to give an in-depth explanation of the optimizer’s algorithm.

If you didn’t read my previous article. I recommend you to first go through my previous articles on optimizers mentioned below and then come back to this article for better understanding:

Optimizers in Machine Learning and Deep Learning

Gradient descent Algorithm: In-Depth explanation

Let’s start…

# Adagrad

**Adagrad stands for an adaptive gradient**. Adagrad is** **an effective algorithm for gradient-based optimization. It adapts the learning rate to the parameters, using low learning rates for parameters associated with frequently occurring features, and using high learning rates for parameters associated with rare features.

Therefore, it is well suited when dealing with sparse data..

But the same update rate may not fit all parameters. For example, some parameters may have reached a stage where only fine adjustment is required, but some parameters need to be adjusted significantly due to the small number of matching samples.

Adagrad raised this issue, an algorithm that offers different learning rates at different parameters between them. What this means is that for each parameter, as its total distance updated increases, its learning rate is also slow.

GloVe word embedding uses **Adagrad **where rare words need major updates and common words need a little update.

Adagrad removes the need to manually tune the learning rate.

There are mainly three problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate in the late training period is very small.
- it needs to be set by doing the initial global learning rate.

Let’s see how it works

In this algorithm, we try to change the learning rate (alpha) for each update. The learning rate changes during each update as it will decrease if the weight is significantly updated in the short term again and it will increase if the weight is not significantly updated.

First, each weight has its own cache value, which collects the squares of the gradients up to the current point.

The cache value will continue to increase as training continues. Now a new update formula can be provided as mentioned below:

The above formula is the same as the original gradient descent formula except that here the learning rate (alpha) constantly changes throughout the training process. The E in the denominator which is shown in the above formula is a very small value which helps us to ensure that the division by zero does not occur.

Essentially what’s happening here is that if a weight has been having **very huge updates**, its cache value is also going to **increase**. As a result, the learning rate will be lower and the size of the weight update will decrease over time.

On the other hand, if a weight has not been having any significant update, its cache value is going to be very less, and hence its learning rate will increase, forcing it to take bigger updates. This is the basic principle of the Adagrad optimizer.

However, the disadvantage of this algorithm is that even if there are previous weight gradients, the cache will always increase by a certain amount because the square cannot be negative. Therefore the learning rate of all weights will eventually drop to a very low level until the training is less intense.

Adagrad can be visualized as:

**Python code for Adagrad**

`def adagrad():`

weights, bais, eta = init_w, init_b, 0.1

v_w, v_b, eps = 0, 0, 1e-8

for i in range(max_epochs):

dw, db = 0, 0

for x,y in zip(X,Y):

dw += grad_w(weights, bais, x, y)

db += grad_b(weights, bais, x, y)

v_w = v_w + dw**2

v_b = v_b + db**2

weights_2 = weights - (eta / np.sqrt(v_w + eps)) * dw

bais_2 = bais - (eta / np.sqrt(v_b + eps)) * db

To overcome the problem of Adagrad there is many other optimizers algorithm available. One of them is **Adadelta**.

# Adadelta

In Adadelta, we do not need to set a default reading rate as we take the effective rate of past steps to the current gradient.

There are three major problems that arise with the Adagrad algorithm.

- The learning rate is monotonically decreasing.
- The learning rate during the late training period is very low.
- it needs to be set by doing the initial global learning rate.

Adadelta is an extension of Adagrad and also tries to reduce Adagrad’s rate of learning, excessively.

It does this by limiting the gradient window that has been exceeded to a certain size ** w**. Running average at time

**then depends on the previous average and the current gradient.**

*t*In Adadelta, we** **don’t have to set the default learning rate as we take the ratio of the running average of the previous time steps to the current gradient.

Let’s see and understand how its work

In the Adadelta optimizer algorithm, it will try not to accumulate all past squared gradients values. It instead tries to restrict the window of accumulated past gradients to some fixed size **(say w).**

Here, it Instead of inefficiently storing w previous squared gradients value, the sum of gradients is recursively defined as a decaying average of all past squared gradients.

The running average **E[g2]t** at time step** t** then depends (as a fraction** γ** similarly to the Momentum term) only on the previous average and the current gradient value:

Next, we set **γ **to a similar value as the momentum term, say around 0.8. To be more specific, lets now rewrite our vanilla SGD update as shown in the image below according to the parameter update vector Δθt:

The parameter update vector of Adagrad that we derived previously can also be written as shown below:

Now we can simply replace the diagonal matrix** Gt** with the decaying average over past squared gradients **E[g2]t **as shown below:

Now as the denominator is just the root mean squared (RMS) error of the gradient, we can replace it with the criterion short-hand as shown below:

**Note**: The units in this update (as well as in SGD, Momentum, or Adagrad) are incompatible, meaning that the update must have the same assumptions as of the parameter. To realize this, we have to first define another exponentially decaying average, this time not of squared gradients but of squared parameter updates. It is shown below:

The **root means a squared error** of parameter updates can be given by as follows:

Since **RMS[Δθ]t **is unknown, we approximate it with the RMS of parameter updates until the previous time step. Replacing the learning rate** η **in the previous update rule with **RMS[Δθ]t−1**. Finally, the Adadelta update rule can be given as shown below:

**Note**: With Adadelta, we do not even need to set a default learning rate, as it has been eliminated from the update rule.

**Python code to write your own function for Adadelta**

`# Adadalta`

def adadelta(params, sqrs, deltas, rho, batch_size):

eps_stable = 1e-5

for param, sqr, delta in zip(params, sqrs, deltas):

g = param.grad / batch_size

sqr[:] = rho * sqr + (1. - rho) * nd.square(g)

cur_delta = nd.sqrt(delta + eps_stable) / nd.sqrt(sqr + eps_stable) * g

delta[:] = rho * delta + (1. - rho) * cur_delta * cur_delta

# update weight

param[:] -= cur_delta

**Adadelta can be visualized as:**

I hope after reading this article, finally, you came to know **what is Adagrad and Adadelta optimizers algorithms and their importance**. In the next articles, I will come with a detailed explanation of some other types of optimizers.** **For more blogs/courses on data science, machine learning, artificial intelligence, and new technologies do visit us at **InsideAIML.**

Thanks for reading…

**Related courses for you:**

Machine Learning with Python & Statistics

**Related blogs for you :**

Adam Optimizer: In-depth explanation

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot