The knob that decides how big a step gradient descent takes. It is not learned from data — you set it, tune it, and often regret it.
Definition
In the gradient descent update
the learning rate scales the gradient before it is subtracted from the parameters. It controls how far each step moves in the downhill direction.
is a hyperparameter — not a learnable parameter. The model doesn’t adjust it during training. You choose it (or use a schedule), typically by trial and error.
Three regimes
| Regime | Symptom | What happens |
|---|---|---|
| Too small | Loss decreases smoothly but very slowly | Training takes forever; may not converge within the iteration budget |
| Too large | Loss oscillates or diverges | Each step overshoots the minimum; can bounce further away every iteration |
| Just right | Loss decreases quickly and stabilises at a low value | Fast convergence to a good solution |
The “just right” value is found by trial and error. Typical starting points in practice: , , or — not , which is only used in toy examples because it makes the arithmetic clean.
Worked example: oscillation at a kink
In the week 2 1D example with , measurements and , gradient descent goes and never reaches the true optimum . The step size of 1 is larger than the distance from 19.5 (or 18.5) to 19, so every step straddles the minimum.
Reducing (say, to ) lets the iterates inch in closer rather than leap across.
Why not just learn it?
You can’t easily learn via gradient descent itself — the gradient of the loss with respect to doesn’t give you a useful signal about whether the learning rate is well-tuned for the problem. In practice people use:
- Manual tuning — try a few orders of magnitude and pick the best.
- Learning rate schedules — start high, decay over time (e.g. ). See below.
- Adaptive optimisers — Adam, RMSProp, and AdaGrad effectively assign a per-parameter learning rate that adapts based on the history of gradients (see gradient-descent-variants).
Learning rate schedules
A constant is rarely optimal: early in training, when the parameters are far from the minimum, large steps make rapid progress; late in training, near the minimum, large steps overshoot and oscillate. A schedule changes over time, ideally large early and small late.
The gradient descent update with a schedule:
where is now a function of the iteration or epoch.
Common schedules
- Step decay. Reduce by a factor (typically 10) every fixed number of epochs: for epochs 1–30, then for epochs 30–60, then , etc. Simple, common, surprisingly effective.
- Exponential decay. with . Smooth, monotonic decrease.
- Reduce-on-plateau. Watch the validation loss; whenever it stops improving for a fixed number of epochs (the patience), drop by a factor. The schedule adapts to what the network is actually doing instead of following a fixed timetable.
- Cosine annealing. follows a half-cosine from down to (near) zero across the training run. Smooth and often slightly better than step decay in practice.
The intuition
Picture the loss landscape as a valley with the minimum somewhere in the middle. With a large you take big strides — useful when you’re far from the bottom and need to traverse the slopes quickly. As you approach the minimum, big strides keep overshooting; you need to tiptoe into the centre. The schedule encodes this physical intuition: stride first, tiptoe later.
TIP — The helicopter-on-helipad picture
Imagine landing a helicopter on a small helipad in a dark valley. Move too fast (high ) and you overshoot the pad on every approach, bouncing around without settling. Move too slow (low ) and you’re precise but take an eternity — or worse, you settle into a small pothole on the way down (a local minimum) because you don’t have the speed to climb out. A schedule gives you both: full speed across the open valley, throttle back as the pad gets close.
TIP — When loss plateaus, drop the learning rate
A common training-curve diagnostic: validation loss decreases for many epochs, then plateaus. If you keep training at the same , nothing changes — the optimiser is bouncing around the minimum, never settling. Dropping by 10× often produces a sudden additional drop in loss, then another plateau. This is reduce-on-plateau in action; it’s the basis of the schedule of the same name.
Related
- gradient descent — the algorithm where lives
- gradient-descent-variants — variants that adapt the learning rate automatically
Active Recall
What are the three qualitative regimes for and the symptom of each?
Too small: very slow convergence (loss decreases but barely). Too large: oscillation or divergence (loss bounces around or grows). Just right: fast smooth decrease to a low loss. Found by trial and error.
Why can't you just learn the way you learn weights?
is a meta-level control on the learning process, not a parameter of the model. The gradient of the loss with respect to doesn’t cleanly tell you whether the step size is appropriate for the current curvature of the landscape. Instead we tune it manually, use a schedule, or let an adaptive optimiser (Adam, RMSProp) effectively adjust a per-parameter step size.
A training loss curve decreases for the first 5 epochs, then oscillates wildly without improving. What's a likely cause and a fix?
The learning rate is too large — once near the minimum, each step overshoots. A fix is a learning rate schedule: decay over time (e.g. divide by 10 every few epochs), which gives fast progress early and fine-grained convergence later.
A network's validation loss has been flat for 10 epochs. What does a "reduce-on-plateau" schedule do, and why does that often help?
When the validation loss stops improving for a fixed number of epochs (the patience window), reduce-on-plateau drops the learning rate by a factor (typically 10). The intuition: the optimiser is making large steps that overshoot the minimum, bouncing around without settling. A smaller step size lets it actually descend the remaining distance into the basin. After dropping , you typically see another sharp loss decrease followed by a new plateau — repeat until further drops stop helping. This is more responsive than a fixed step-decay schedule because the schedule reacts to what the network is actually doing.