Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam

Home » News » Momentum and Adaptive Learning based Gradient Descent Optimizers: Adagrad and Adam
In my previous article on Gradient Descent Optimizers, we had discussed about three types of Gradient Descent algorithms:


1. Batch Gradient Descent
2. Stochastic Gradient Descent
3. Mini Batch Gradient Descent


In this article, we will see some advanced versions of Gradient Descent which can be categorized as:


1. Momentum based (Nesterov Momentum)
2. Based on adaptive learning rate (Adagrad, Adadelta, RMSprop)
3. Combination of momentum and adaptive learning rate (Adam)


Lets first understand something about momentum.


Momentum


Momentum helps in accelerating SGD in a relevant direction. So, its a good idea to also consider momentum for every parameter. It has following advantages:


1. Avoids local minima: As momentum adds up speed and hence increases the step size, optimizer will not get trapped in local minima.


2. Faster convergence: Momentum makes the convergence faster as it increases the step size due to the gained speed.


Now, lets see some flavors of SGD.


1. Nesterov Momentum


It finds out the current momentum and based upon that approximates the next position. And then, it calculates the gradient w.r.t next approximated position instead of calculating gradient w.r.t current position. This thing prevents us from going too fast and results in increased responsiveness, which significantly increases the performance of SGD.


2. Adagrad


It mainly focuses on adaptive learning rate instead of momentum


In standard SGD, learning rate is always constant. It means, we have to go with same speed irrespective of the slope. This seems impractical in real life. 


What happen if we know that we should slow down or speed up? What happen if we know that we should accelerate more in this direction and decelerate in that direction? Its not possible using the standard SGD.


Adagrad keeps updating the learning rate instead of using constant learning rate. It accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate, so that now the learning rate could be smaller or larger depending on how the past gradients behaved.


It adapts the learning rate to the parameters, performing smaller updates (i.e. low learning rates) for parameters associated with frequently occurring features, and larger updates (i.e. high learning rates) for parameters associated with infrequent features. For this reason, it is well-suited for dealing with sparse data.


2A. AdaDelta and RMSprop


AdaDelta and RMSprop are an extension of Adagrad.


As discussed in Adagrad section, Adagrad accumulates the sum of squared of all of the gradient, and use that to normalize the learning rate. Due to this, Adagrad encounters an issue. The issue is that learning rate in Adagrad keeps on decreasing due to which at a point learning almost stops. 


To handle this issue AdaDelta and RMSprop decay the past accumulated gradient, so only a portion of past gradients are considered. Now, instead of considering all of the past gradients, we consider the moving average.


3. Adam


Adam is the finest Gradient Descent Optimizer and is widely used. It uses powers of both momentum and adaptive learning. In other words, Adam is RMSprop or AdaDelta with momentum. It considers momentum and also normalize the learning rate using the moving average squared gradient.


Conclusion: Most of the above Gradient Descent methods are already implemented in the popular Deep Learning frameworks like Tensorflow, Keras, Caffe etc. However, Adam is currently the default recommended algorithm to be used as it utilizes both momentum and adaptive learning features.


For more details on above algorithms, I strongly refer this and this article.

Leave a Reply

Your email address will not be published. Required fields are marked *

New Providers
Binolla

The Broker
More then 2 million businesses
See Top 10 Broker

gamehag

Online game
More then 2 million businesses
See Top 10 Free Online Games

New Games
Lies of P

$59.99 Standard Edition
28% Save Discounts
See Top 10 Provider Games

COCOON

$24.99 Standard Edition
28% Save Discounts
See Top 10 Provider Games

New Offers
Commission up to $1850 for active user of affiliate program By Exness

Top Points © Copyright 2023 | By Topoin.com Media LLC.
Topoin.info is a site for reviewing the best and most trusted products, bonus, offers, business service providers and companies of all time.

Discover more from Topoin

Subscribe now to keep reading and get access to the full archive.

Continue reading