A 2021 Guide to improving CNN’s-Recent Optimizers


Original Source Here


Unlike previous proposals, which mostly improve the Adam family methods, LookAhead proposes a completely different approach to accelerating optimization. Lookahead uses two optimizers to first update the “fast weights” k times before updating the “slow weights” once in the direction of the final fast weights. This procedure will reduce the variance of the slow weights with negligible computation cost.

To clarify, the weights only progress a part of α(θt,k −φt−1)of the difference between the previous state and the end weights of the fast weight. According to the paper, we can benefit from the stability of the method and use a larger learning rate.


AdaBelief is proposed to simultaneously achieve three goals: fast convergence as in adaptive methods, good generalization as in SGD, and training stability. AdaBelief adapts the step size according to the “belief” in the current gradient direction. Viewing the exponential moving average (EMA) of the gradient as the prediction of the gradient at the next time step, if the observed gradient greatly deviates from the prediction the optimizer takes a smaller step.

The EMA of gradients is actually the momentum term in the Adam optimizer and intuitively views the momentum value m_t as the prediction of the gradient g_t. When the predicted value and true value are similar((g_t-m_t)² is small), AdaBelief takes a larger step and otherwise takes a smaller step.

The AdaBelief is observed to have the following effects:

  • Fast convergence as in adaptive methods.
  • Good or better generalization as in the SGD family.
  • Training stability in complex settings such as GANs.
  • Doesn’t introduce new hyperparameters from Adam.

More optimizers

The effectiveness of optimizers could be highly dependent on the training configuration and hyperparameters, and thus the best working optimizer in real-world tasks might be found using tedious search processes. Such efforts might or might not be worth spending, since the benefit of optimizers not generally recognized and is very unclear. However, according to the experiments presented in the papers above, optimizer selection seems to have a large effect on generalization performance.

The following repository provides PyTorch implementations of all the optimizers introduce in this post, and more.

Learning Rate Schedule

Learning rate schedule refers to the method to modify the learning rate during training to improve performance. We often decay the learning rate by a certain ratio after a certain number of iterations(step decay) or employ some numerical function of the current iteration that outputs the learning rate.

Surprisingly, advanced learning rate schedules, especially the Cosine LR decay we will discuss soon have large effects on the performance of NNs.

Image from [10]


In short, SGDR decay the learning rate using cosine annealing, described in the equation below.

Additional to the cosine annealing, the paper uses simulated warm restart every T_i epochs, which is gradually increased during training. This is to denoise the incoming information since gradients and loss values can vary widely from one batch of the data to another.

This post describes the intuition behind such learning rate restarts. In short, the surge in the learning rate could theoretically push the model out of bad local minimums. The learning rate schedule can be plotted as the figure below.

The results of the experiments are as the figure below. Step learning rate decay(blue, red) was unstable, slow, and even had worse final performance compared to the SGDR learning rate schedule. We can see that warm restarts actually did show significant speed boosts because the dark green and purple schedules which had very often restarts showed very fast initial performance.

Most significantly, the proposed SGDR was able to speed up training in orders of magnitude.


In this post, we reviewed the methods and background of some modern optimizers. We observed that the selection of optimizers has a significant impact on training speed and final performance. We saw that the novel AdaBelief outperformed other optimizers in its presented experiments.

However, I am unsure on whether the superior performance could be generalized to any training setting without further tuning of hyperparameters. Because classical optimizers such as Adam and SGD are prevalent despite the potential accuracy and speed gains of modern optimizers, there remains uncertainty on the general application of such optimizers.

One thing to note is that most of these optimizers were built on the claim that SGD generalizes better than adaptive methods such as Adam. However, it is unclear whether that the claim is true since there are opposing results[8]. This topic of the generalization performance between SGD and Adam was explored in my previous post.


[1] Loshchilov, I., & Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

[2] Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of machine learning research, 12(7).

[3] Zhang, M. R., Lucas, J., Hinton, G., & Ba, J. (2019). Lookahead optimizer: k steps forward, 1 step back. arXiv preprint arXiv:1907.08610.

[4] Luo, L., Xiong, Y., Liu, Y., & Sun, X. (2019). Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843.

[5] Zhuang, J., Tang, T., Ding, Y., Tatikonda, S., Dvornek, N., Papademetris, X., & Duncan, J. S. (2020). Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. arXiv preprint arXiv:2010.07468.

[6] Loshchilov, I., & Hutter, F. (2016). Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

[7] Keskar, N. S., & Socher, R. (2017). Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628.

[8] Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J., & Dahl, G. E. (2019). On empirical comparisons of optimizers for deep learning. arXiv preprint arXiv:1910.05446.

[9] Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., & Recht, B. (2017). The marginal value of adaptive gradient methods in machine learning. arXiv preprint arXiv:1705.08292.

[10] Bello, I., Fedus, W., Du, X., Cubuk, E. D., Srinivas, A., Lin, T. Y., … & Zoph, B. (2021). Revisiting resnets: Improved training and scaling strategies. arXiv preprint arXiv:2103.07579.

[11] You, Y., Li, J., Reddi, S., Hseu, J., Kumar, S., Bhojanapalli, S., … & Hsieh, C. J. (2019). Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962.

[12] Hardt, M., Recht, B., & Singer, Y. (2016, June). Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning (pp. 1225–1234). PMLR.

[13] Reddi, S., Zaheer, M., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. In Proceeding of 32nd Conference on Neural Information Processing Systems (NIPS 2018).


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: