Vanilla gradient descent works but is slow, memory-hungry, and gets stuck. The variants below each address a specific weakness, and Adam — which combines several of them — is the de facto default in modern deep learning.

The cost problem: sampling variants

Vanilla gradient descent computes the loss over the entire training set at every iteration:

When is millions (or billions), one step can be minutes of compute. The fix is to estimate the gradient from a subset.

VariantSamples per stepCharacter
Full-batch GDAll samplesAccurate gradient, expensive, smooth convergence path
Stochastic GD (SGD)Single random sample (so addresses the cost issue of full-batch DG)Cheap gradient, very noisy, bounces around the optimum
Mini-batch GDRandom subset of size , with Best of both — reasonable cost, acceptable noise. In other words its the middle ground of full-batch and stochastic GD.

Note the extreme cases recover the other two: is SGD, is full-batch GD. Common batch sizes are 32, 64, 128.

Because mini-batch exists as a contrast, vanilla gradient descent is often called “full-batch gradient descent” — the name emphasises that it uses the full training set on every step. If you see “full-batch GD” in a paper or textbook, it means the same thing as plain gradient descent.

ASIDE — "Stochastic" in practice

Strictly, SGD picks one sample at a time. But “SGD” is often used loosely to mean mini-batch GD as well, especially in deep learning frameworks. When someone says “we trained with SGD”, they almost always mean mini-batch SGD with some batch size.

Epochs vs iterations

An iteration is one parameter update (one forward/backward pass on one batch). An epoch is one full pass through the training set. With 1,000 samples and batch size 50, one epoch = 20 iterations.

Convergence behaviour

Full-batch GD moves steadily along a smooth trajectory toward the minimum — each step is informed by the entire dataset. Mini-batch / stochastic GD moves noisily, oscillating around its path because each step sees only a rough estimate of the true gradient. The noise has an upside: it can jump out of shallow local minima that would trap vanilla GD.

The local minima problem: momentum

Vanilla GD’s update only uses the current gradient — it has no memory of where it was heading. Narrow valleys cause it to oscillate across the steep walls; shallow basins trap it.

Momentum accumulates a velocity vector that blends the current gradient with the previous direction:

  • is the decay (typical value: 0.9). Past velocity gradually fades.
  • .
  • At , momentum reduces to vanilla GD.
  • At , no decay — past gradients accumulate indefinitely, behaviour becomes unstable.

Intuition: a ball rolling downhill carries momentum from previous steps, so small uphill bumps don’t stop it dead — it coasts over shallow local minima toward deeper ones. Steep consistent directions get amplified (past velocity reinforces current gradient); oscillating directions cancel out (signs flip between steps).

Basically, if a person has confidence in where they are going they can make bigger moves and land at solution faster e.g. if you are driving and have clarity on the road you tend to drive faster as opposed to newer roads

The learning-rate problem: adaptive methods

Tuning a single global is brittle — different parameters may benefit from different step sizes. Adaptive optimisers adjust a per-parameter effective learning rate based on the history of gradients.

AdaGrad, RMSProp (sketch)

  • AdaGrad divides each parameter’s update by the square root of the accumulated squared gradients for that parameter. Directions that have moved a lot get slowed down; directions that haven’t get amplified. Downside: the accumulator grows forever, so learning eventually stalls.
  • RMSProp replaces the cumulative sum with an exponentially-decayed moving average, which prevents the slowdown.

Adam (the default)

Adam (Adaptive Moment Estimation) combines momentum with RMSProp-style adaptive learning rates:

Typical hyperparameters: , , , .

Adam is the de facto standard optimiser in modern deep learning. It works well out of the box on a very wide range of problems.

Comparison summary

OptimiserGradient infoLearning rateMain strengthMain cost
Vanilla GDCurrent onlyFixed global Simple; predictableSlow, stuck in local minima
MomentumCurrent + exp-weighted pastFixed global Faster, escapes shallow minimaOne extra hyperparameter ()
AdaGradCurrent + all past squaredPer-parameter, monotonically shrinkingAuto-tunes per parameterLearning eventually dies
RMSPropCurrent + exp-weighted past squaredPer-parameter, boundedFixes AdaGrad’s decayNo momentum
AdamBoth (momentum + RMSProp)Per-parameter, adaptiveBest of both; robust defaultMost hyperparameters to tune

When to use what

  • Small dataset, want maximum accuracy: full-batch GD.
  • Large dataset, need speed per step: mini-batch SGD (batch size 32–128).
  • Non-convex loss with local minima: momentum or Adam.
  • Default modern choice: Adam with mini-batch.

Active Recall