Beyond Vanilla: Powerful Optimisers to Enhance Gradient Descent

Fine-Tune Your Gradient Descent Training with Powerful Optimisation Techniques

Published in

Level Up Coding

4 min readMay 7, 2024

Gradient descent helps to train your neural network model by helping minimising your defined cost function by updating the weights in the direction of the gradient.

With deep learning and neural networks, sometimes we just have to experiment with different values of hyperparameters to find out the optimum model for our use case.

Well, having the algorithms which can improve the speed of training can help a lot in the process, which makes it important to learn these algorithms.

These algorithms are based on the concept of weighted moving averages, so lets discuss that first..

Weighted Moving Averages

In weighted moving average , we define a tuneable parameter β and calculate our moving average in the following way…

where v_{t} represents the average at current element , v_{t-1} represents the average upto the previous elements and a_{t} is the current element.

The effect of high β can be seen in the image above. The curve in green represents the moving average with a higher β than the red one.

As seen, increasing β results in smoothening the curve but it tends to shift a little bit to right as well.

Weighted moving averages are used in multiple areas like machine learning(duh!) and finance to name a few.

Now lets learn about the different optimisations to improve gradient descent!

Momentum

With the original gradient descent the formula to update the weights W and biases b is..

Where α is the learning rate and dW and db are the derivatives of the loss w.r.t W and b respectively.

With momentum, this update formula changes to:

Momentum Weight updation formula

As we can see above , we use the concept of moving averages to compute the amount by which the weights to be updated.

RMSProp

RMSProp(Root Mean Squared Propagation) is another algorithm which helps to speed up gradient descent.

It also uses the concept of weighted moving averages to update the weights and biases in the neural network model.

With RMSProp the formulas modify to…

where we perform element wise matrix multiplication on dW and db to calculate the moving averages.

We also add a small amount ϵ to the square root so that the denominator does not go too close to zero.

Adam

We have seen Momentum and RMSProp using the concept of weighted averages.

But what happens when we combine the two? Yes, we get the Adam(Adaptive Moment Estimation) algorithm.

Photo by Priscilla Du Preez 🇨🇦 on Unsplash

With Adam, the weight updation formula becomes…

Here, lines 4–8 represent the bias correction, where the power term t represents the number of times this step has been run since we are calculating the moving average to take care of the initial deviation while calculating moving averages.

Why do these algorithms work?

To understand this, suppose we draw a contour of our loss function with the minimum in the centre and start gradient descent from a point, as shown below.

Loss function contour | Image by the Author

Now, with regular gradient descent, we don’t directly move towards the minimum but with some movement in the other direction as well. These are apparent as some sort of “vertical oscillations”.

By employing moving averages, we essentially dampen the vertical oscillations as their values tend to move towards zero with more iterations.

This also keeps the “horizontal component” (movement towards minimum) large enough to perform gradient descent effectively.

In this article we learned about the different optimisation algorithms used along with gradient descent to improve its speed. All these algorithms are implemented in popular ML libraries like TensorFlow.

If you like this content, please give a clap. . I will be writing about different things I learn and posting regularly. You can even comment on what you would like to see in the coming weeks. Happy coding…

References:

Improving Deep Neural Networks course by Coursera.
Math equations embedded from https://math.embed.fun/