gradient-descent-variants

Vanilla gradient descent works but is slow, memory-hungry, and gets stuck. The variants below each address a specific weakness, and Adam — which combines several of them — is the de facto default in modern deep learning.

The cost problem: sampling variants

Vanilla gradient descent computes the loss over the entire training set at every iteration:

$L (θ) = \sum_{j = 1}^{n} (y^{j} - \overset{y}{^}^{j})^{2} (regression example)$

When $n$ is millions (or billions), one step can be minutes of compute. The fix is to estimate the gradient from a subset.

Variant	Samples per step	Character
Full-batch GD	All $n$ samples	Accurate gradient, expensive, smooth convergence path
Stochastic GD (SGD)	Single random sample (so addresses the cost issue of full-batch DG)	Cheap gradient, very noisy, bounces around the optimum
Mini-batch GD	Random subset $B$ of size $m$ , with $m ≪ n$	Best of both — reasonable cost, acceptable noise. In other words its the middle ground of full-batch and stochastic GD.

$L (θ) = \sum_{j \in B} (y^{j} - \overset{y}{^}^{j})^{2}$

Note the extreme cases recover the other two: $m = 1$ is SGD, $m = n$ is full-batch GD. Common batch sizes are 32, 64, 128.

Because mini-batch exists as a contrast, vanilla gradient descent is often called “full-batch gradient descent” — the name emphasises that it uses the full training set on every step. If you see “full-batch GD” in a paper or textbook, it means the same thing as plain gradient descent.

ASIDE — "Stochastic" in practice

Strictly, SGD picks one sample at a time. But “SGD” is often used loosely to mean mini-batch GD as well, especially in deep learning frameworks. When someone says “we trained with SGD”, they almost always mean mini-batch SGD with some batch size.

Epochs vs iterations

An iteration is one parameter update (one forward/backward pass on one batch). An epoch is one full pass through the training set. With 1,000 samples and batch size 50, one epoch = 20 iterations.

Convergence behaviour

Full-batch GD moves steadily along a smooth trajectory toward the minimum — each step is informed by the entire dataset. Mini-batch / stochastic GD moves noisily, oscillating around its path because each step sees only a rough estimate of the true gradient. The noise has an upside: it can jump out of shallow local minima that would trap vanilla GD.

The local minima problem: momentum

Vanilla GD’s update only uses the current gradient — it has no memory of where it was heading. Narrow valleys cause it to oscillate across the steep walls; shallow basins trap it.

Momentum accumulates a velocity vector that blends the current gradient with the previous direction:

$v^{t} = η \nabla L_{t} + β v^{t - 1}, θ^{t + 1} = θ^{t} - v^{t}$

$β \in (0, 1)$ is the decay (typical value: 0.9). Past velocity gradually fades.
$v^{0} = 0$ .
At $β = 0$ , momentum reduces to vanilla GD.
At $β = 1$ , no decay — past gradients accumulate indefinitely, behaviour becomes unstable.

Intuition: a ball rolling downhill carries momentum from previous steps, so small uphill bumps don’t stop it dead — it coasts over shallow local minima toward deeper ones. Steep consistent directions get amplified (past velocity reinforces current gradient); oscillating directions cancel out (signs flip between steps).

Basically, if a person has confidence in where they are going they can make bigger moves and land at solution faster e.g. if you are driving and have clarity on the road you tend to drive faster as opposed to newer roads

The learning-rate problem: adaptive methods

Tuning a single global $η$ is brittle — different parameters may benefit from different step sizes. Adaptive optimisers adjust a per-parameter effective learning rate based on the history of gradients.

AdaGrad, RMSProp (sketch)

AdaGrad divides each parameter’s update by the square root of the accumulated squared gradients for that parameter. Directions that have moved a lot get slowed down; directions that haven’t get amplified. Downside: the accumulator grows forever, so learning eventually stalls.
RMSProp replaces the cumulative sum with an exponentially-decayed moving average, which prevents the slowdown.

Adam (the default)

Adam (Adaptive Moment Estimation) combines momentum with RMSProp-style adaptive learning rates:

$m^{t} = β_{1} m^{t - 1} + (1 - β_{1}) \nabla L_{t} (1st moment: momentum)$ $s^{t} = β_{2} s^{t - 1} + (1 - β_{2}) (\nabla L_{t} ⊙ \nabla L_{t}) (2nd moment: squared gradient)$ $\hat{m}^{t} = \frac{m ^{t}}{1 - β _{1}^{t}}, \hat{s}^{t} = \frac{s ^{t}}{1 - β _{2}^{t}} (bias correction)$ $θ^{t + 1} = θ^{t} - η \frac{m ^ ^{t}}{s ^ ^{t} + ε}$

Typical hyperparameters: $β_{1} = 0.9$ , $β_{2} = 0.999$ , $η = 1 0^{- 3}$ , $ε = 1 0^{- 8}$ .

Adam is the de facto standard optimiser in modern deep learning. It works well out of the box on a very wide range of problems.

Comparison summary

Optimiser	Gradient info	Learning rate	Main strength	Main cost
Vanilla GD	Current only	Fixed global $η$	Simple; predictable	Slow, stuck in local minima
Momentum	Current + exp-weighted past	Fixed global $η$	Faster, escapes shallow minima	One extra hyperparameter ( $β$ )
AdaGrad	Current + all past squared	Per-parameter, monotonically shrinking	Auto-tunes per parameter	Learning eventually dies
RMSProp	Current + exp-weighted past squared	Per-parameter, bounded	Fixes AdaGrad’s decay	No momentum
Adam	Both (momentum + RMSProp)	Per-parameter, adaptive	Best of both; robust default	Most hyperparameters to tune

When to use what

Small dataset, want maximum accuracy: full-batch GD.
Large dataset, need speed per step: mini-batch SGD (batch size 32–128).
Non-convex loss with local minima: momentum or Adam.
Default modern choice: Adam with mini-batch.

gradient descent — the base algorithm these variants build on
learning-rate — the $η$ these methods (sometimes) adapt

Active Recall

What is the difference between an epoch and an iteration when training with mini-batch gradient descent?

An iteration is one parameter update (one batch processed). An epoch is one complete pass through the training set — so if there are $n$ samples and batch size $m$ , one epoch contains $n / m$ iterations.

Write the momentum update rule and say what the decay parameter $β$ does.

$v^{t} = η \nabla L_{t} + β v^{t - 1}$ then $θ^{t + 1} = θ^{t} - v^{t}$ . $β \in (0, 1)$ controls how much of the previous velocity is carried forward. $β = 0$ forgets all past direction (vanilla GD); $β$ close to 1 keeps most of it, giving a heavier “ball rolling downhill” behaviour. Typical value $β = 0.9$ .

Why does stochastic gradient descent sometimes escape local minima that vanilla gradient descent gets stuck in?

SGD’s gradient estimate is noisy — it’s based on a single sample (or small batch), not the full dataset. The noise adds random perturbations to each step, which can push the iterate over the rim of a shallow local basin into a different (potentially deeper) one. Vanilla GD uses the exact full-data gradient, which points straight into the nearest minimum with no randomness to break out of it.

For a dataset of 10,000 samples and batch size 100, how many iterations are in one epoch?

Each iteration consumes 100 samples; $10, 000/100 = 100$ iterations to see the full dataset once.

What two ideas does Adam combine, and what does each contribute?

Adam combines momentum (tracks an exponentially-decayed average of past gradients, giving smooth, accelerated motion in consistent directions) with RMSProp-style adaptive learning rates (tracks an exponentially-decayed average of squared gradients, so parameters with historically large gradients get smaller effective step sizes). The result is per-parameter step sizes that auto-tune and benefit from momentum — a strong default optimiser.

Course Notes

Explorer

gradient-descent-variants

The cost problem: sampling variants

Epochs vs iterations

Convergence behaviour

The local minima problem: momentum

The learning-rate problem: adaptive methods

AdaGrad, RMSProp (sketch)

Adam (the default)

Comparison summary

When to use what

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

gradient-descent-variants

The cost problem: sampling variants

Epochs vs iterations

Convergence behaviour

The local minima problem: momentum

The learning-rate problem: adaptive methods

AdaGrad, RMSProp (sketch)

Adam (the default)

Comparison summary

When to use what

Related

Active Recall

Graph View

Table of Contents

Backlinks