Gradient Descent is the widely used algorithm to optimize our neural networks in deep learning. There are a lot of flavors of gradient descent, so lets discuss few of them.
Batch Gradient Descent (BGD)
In Batch Gradient Descent, we process the entire training dataset in one iteration. We calculate the error, gradient and new weight for each observation in the training dataset, but only update the model at the end after all the training observations have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
Advantages
1. Fewer updates to the model as it calculates all the gradients in one epoch and then updates the model.
2. It generally leads to more stable error gradient and hence more stable convergence.
Disadvantages
1. Sometimes it may result in premature convergence (local minima).
2. Training speed usually becomes very slow for large datasets.
3. Requires a lot of memory while gradient computation as it performs on the entire dataset.
Stochastic Gradient Descent (SGD)
In Stochastic Gradient Descent, we process a single observation (instead of entire dataset) from the training dataset in each iteration. We calculate the error, gradient and new weight and keep updating the model for each observation in the training dataset.
Advantages
1. Computing the gradient is faster as compared to the batch gradient descent.
2. Learning is much faster than batch gradient descent.
Disadvantages
1. It keeps updating the model for each observation. These frequent updates are computationally expensive and can take significantly longer time to train the models on large datasets.
2. The frequent updates can also result in a noisy gradient signal.
Mini Batch Gradient Descent (MBGD)
In Mini Batch Gradient Descent, we process a small subset of the training dataset in each iteration. So, we can say that it is a compromise between BGD and SGD. It maintains a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. As it uses the powers of both BGD and SGD, it is the most widely used gradient descent in deep learning.
Advantages
1. The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
2. Faster than BGD as it uses a small subset of the training dataset as compared to the entire dataset.
3. Leads to more accuracy as compared to SGD which uses only one data point in each iteration.
Disadvantages
1. Batch size is a very important hyper-parameter. It can vary depending on the dataset. So, deciding the batch size is a very crucial step in MBGD. As per some researches and papers, batch size of 32 is the best fit.
Summary
Batch Gradient Descent, Mini Batch Gradient Descent and Stochastic Gradient Descent vary depending upon batch size of m and a training set of size n.
For stochastic gradient descent, m=1.
For batch gradient descent, m = n.
For mini-batch gradient descent, m=b where b < n.
Which one to use?
I would prefer to use mini batch size gradient descent as it is an optimized version of both batch gradient descent and stochastic gradient descent and is also suitable for any size of dataset.
Note: There are some advanced versions of Gradient Descent like NAG (Nesterov Accelerated Gradient), AdaGrad, AdaDelta, RMSprop and Adam (Adaptive Moment Estimation). I have written a separate article on these algorithms.
Batch Gradient Descent (BGD)
In Batch Gradient Descent, we process the entire training dataset in one iteration. We calculate the error, gradient and new weight for each observation in the training dataset, but only update the model at the end after all the training observations have been evaluated.
One cycle through the entire training dataset is called a training epoch. Therefore, it is often said that batch gradient descent performs model updates at the end of each training epoch.
Advantages
1. Fewer updates to the model as it calculates all the gradients in one epoch and then updates the model.
2. It generally leads to more stable error gradient and hence more stable convergence.
Disadvantages
1. Sometimes it may result in premature convergence (local minima).
2. Training speed usually becomes very slow for large datasets.
3. Requires a lot of memory while gradient computation as it performs on the entire dataset.
Stochastic Gradient Descent (SGD)
In Stochastic Gradient Descent, we process a single observation (instead of entire dataset) from the training dataset in each iteration. We calculate the error, gradient and new weight and keep updating the model for each observation in the training dataset.
Advantages
1. Computing the gradient is faster as compared to the batch gradient descent.
2. Learning is much faster than batch gradient descent.
Disadvantages
1. It keeps updating the model for each observation. These frequent updates are computationally expensive and can take significantly longer time to train the models on large datasets.
2. The frequent updates can also result in a noisy gradient signal.
Mini Batch Gradient Descent (MBGD)
In Mini Batch Gradient Descent, we process a small subset of the training dataset in each iteration. So, we can say that it is a compromise between BGD and SGD. It maintains a balance between the robustness of stochastic gradient descent and the efficiency of batch gradient descent. As it uses the powers of both BGD and SGD, it is the most widely used gradient descent in deep learning.
Advantages
1. The model update frequency is higher than batch gradient descent which allows for a more robust convergence, avoiding local minima.
2. Faster than BGD as it uses a small subset of the training dataset as compared to the entire dataset.
3. Leads to more accuracy as compared to SGD which uses only one data point in each iteration.
Disadvantages
1. Batch size is a very important hyper-parameter. It can vary depending on the dataset. So, deciding the batch size is a very crucial step in MBGD. As per some researches and papers, batch size of 32 is the best fit.
Summary
Batch Gradient Descent, Mini Batch Gradient Descent and Stochastic Gradient Descent vary depending upon batch size of m and a training set of size n.
For stochastic gradient descent, m=1.
For batch gradient descent, m = n.
For mini-batch gradient descent, m=b where b < n.
Which one to use?
I would prefer to use mini batch size gradient descent as it is an optimized version of both batch gradient descent and stochastic gradient descent and is also suitable for any size of dataset.
Note: There are some advanced versions of Gradient Descent like NAG (Nesterov Accelerated Gradient), AdaGrad, AdaDelta, RMSprop and Adam (Adaptive Moment Estimation). I have written a separate article on these algorithms.