The knob that decides how big a step gradient descent takes. It is not learned from data — you set it, tune it, and often regret it.

Definition

In the gradient descent update

the learning rate scales the gradient before it is subtracted from the parameters. It controls how far each step moves in the downhill direction.

is a hyperparameter — not a learnable parameter. The model doesn’t adjust it during training. You choose it (or use a schedule), typically by trial and error.

Three regimes

RegimeSymptomWhat happens
Too smallLoss decreases smoothly but very slowlyTraining takes forever; may not converge within the iteration budget
Too largeLoss oscillates or divergesEach step overshoots the minimum; can bounce further away every iteration
Just rightLoss decreases quickly and stabilises at a low valueFast convergence to a good solution

The “just right” value is found by trial and error. Typical starting points in practice: , , or not , which is only used in toy examples because it makes the arithmetic clean.

Worked example: oscillation at a kink

In the week 2 1D example with , measurements and , gradient descent goes and never reaches the true optimum . The step size of 1 is larger than the distance from 19.5 (or 18.5) to 19, so every step straddles the minimum.

Reducing (say, to ) lets the iterates inch in closer rather than leap across.

Why not just learn it?

You can’t easily learn via gradient descent itself — the gradient of the loss with respect to doesn’t give you a useful signal about whether the learning rate is well-tuned for the problem. In practice people use:

  • Manual tuning — try a few orders of magnitude and pick the best.
  • Learning rate schedules — start high, decay over time (e.g. ). See below.
  • Adaptive optimisers — Adam, RMSProp, and AdaGrad effectively assign a per-parameter learning rate that adapts based on the history of gradients (see gradient-descent-variants).

Learning rate schedules

A constant is rarely optimal: early in training, when the parameters are far from the minimum, large steps make rapid progress; late in training, near the minimum, large steps overshoot and oscillate. A schedule changes over time, ideally large early and small late.

The gradient descent update with a schedule:

where is now a function of the iteration or epoch.

Common schedules

  • Step decay. Reduce by a factor (typically 10) every fixed number of epochs: for epochs 1–30, then for epochs 30–60, then , etc. Simple, common, surprisingly effective.
  • Exponential decay. with . Smooth, monotonic decrease.
  • Reduce-on-plateau. Watch the validation loss; whenever it stops improving for a fixed number of epochs (the patience), drop by a factor. The schedule adapts to what the network is actually doing instead of following a fixed timetable.
  • Cosine annealing. follows a half-cosine from down to (near) zero across the training run. Smooth and often slightly better than step decay in practice.

The intuition

Picture the loss landscape as a valley with the minimum somewhere in the middle. With a large you take big strides — useful when you’re far from the bottom and need to traverse the slopes quickly. As you approach the minimum, big strides keep overshooting; you need to tiptoe into the centre. The schedule encodes this physical intuition: stride first, tiptoe later.

TIP — The helicopter-on-helipad picture

Imagine landing a helicopter on a small helipad in a dark valley. Move too fast (high ) and you overshoot the pad on every approach, bouncing around without settling. Move too slow (low ) and you’re precise but take an eternity — or worse, you settle into a small pothole on the way down (a local minimum) because you don’t have the speed to climb out. A schedule gives you both: full speed across the open valley, throttle back as the pad gets close.

TIP — When loss plateaus, drop the learning rate

A common training-curve diagnostic: validation loss decreases for many epochs, then plateaus. If you keep training at the same , nothing changes — the optimiser is bouncing around the minimum, never settling. Dropping by 10× often produces a sudden additional drop in loss, then another plateau. This is reduce-on-plateau in action; it’s the basis of the schedule of the same name.

Active Recall