The workhorse algorithm of machine learning: start with random parameters, compute the slope of the loss at that point, take a small step downhill, repeat.

Motivation: measuring badness isn’t enough

Week 1 gave us a loss-function — a way to score how bad the current parameters are. But telling a model how badly it’s doing is not the same as telling it what to change to do better. We need a mechanism that turns the loss measurement into a prescription for updating each parameter. Gradient descent is that mechanism.

Formally, gradient descent is a local search technique for minimising a cost function. “Local” because at any point in parameter space it only looks at the slope right there and takes one small step — it has no global map of the landscape.

The loss is typically a sum or average over all training samples (e.g. ). So each gradient step doesn’t just improve performance on one example — it pushes the parameters toward better performance across the whole training set at once. Every sample contributes to every update. (This is also why vanilla GD is expensive for large datasets; stochastic and mini-batch variants estimate the gradient from a subset instead — see gradient-descent-variants.)

The core equation

You should know this by heart:

  • — the current parameter vector (e.g. )
  • — the gradient of the loss-function at
  • — the learning-rate (a hyperparameter)
  • Minus sign — the gradient points uphill, so we negate it to walk downhill

The algorithm is:

  1. Initialise randomly.
  2. Compute .
  3. Update .
  4. Repeat until convergence or a max iteration count.

Why it works: the slope tells you which way is down

The gradient is a vector of partial derivatives — one component per parameter:

From calculus, the gradient points in the direction of steepest increase. Negating it gives the direction of steepest decrease — which is exactly what minimising the loss requires. The learning rate controls how far along that direction to step.

TIP — The blindfolded hiker analogy

Imagine someone blindfolded on a hillside who wants to reach the valley. They can’t see, but they can feel the slope under their feet. They step in whichever direction goes downhill, and repeat. Gradient descent is exactly this — the loss landscape is the hill, the gradient is the slope they feel, and is the length of each step.

The analogy also explains why the loss must be smooth: a ball can only roll down a smooth hill. If the surface is kinked or discontinuous (as with the sign function), there’s no well-defined slope to follow — the ball gets stuck or stops at flat steps. Smoothness is what makes the landscape rollable.

The 1D picture

In one dimension the algorithm collapses to a simple rule. Compute the slope at your current point:

  • Slope positive → function is increasing → step left (decrease ).
  • Slope negative → function is decreasing → step right (increase ).

The update encodes both cases in one expression: the minus sign automatically inverts the direction of the slope.

Worked example: 1D with absolute error

Take the loss-function with measurements , , . Starting from with :

024.521.5
121.520.5
220.519.5
319.518.5
418.519.5

The true optimum is (the median, for absolute error). Notice the last two rows — the estimate gets stuck oscillating between and . This is exactly the symptom of a learning rate that’s too large for the neighbourhood of the optimum.

Why the gradient?

Of all the directions you could step in, why is the gradient the one to follow? Because multivariable calculus gives you a theorem for free: the gradient points in the direction of steepest ascent of , and its length is the rate of change in that steepest direction. No other unit direction increases the function faster at that point. By symmetry, is the direction of steepest descent — the single best choice for minimisation.

Concretely, the gradient is just a vector of partial derivatives:

Each partial derivative answers a simple question: how much does the loss change if I nudge a tiny bit, holding everything else fixed?. Stacking all these local sensitivities into one vector gives you the gradient.

A non-spatial way to read the gradient

Picturing the gradient as an arrow in a high-dimensional parameter space is hard. A more useful reading — especially for networks with millions of parameters — is to think of as a list of per-parameter instructions. Each component tells you two things:

  1. Sign — should be nudged up or down?
  2. Relative magnitudehow much does this particular parameter matter? A component with a large magnitude means that parameter is currently pulling a lot of weight on the loss; nudging it will change the loss a lot. A component close to zero means the parameter barely matters right now.

So is a prioritised to-do list for the parameters: tweak these weights in these directions, and give more push to the ones that will move the loss the most.

Bang for your buck. In a network with 13,002 weights and biases (say), the gradient encodes the relative importance of each one at the current point in training. Changing one weight might barely move the loss; changing another might move it a lot. The gradient tells you which is which, automatically.

TIP — Self-correcting step sizes near a smooth minimum

For smooth losses (e.g. squared error, cross-entropy), the gradient magnitude naturally shrinks as you approach the minimum — the slope flattens out. Because the update is , the step size shrinks with it, slowing the optimiser automatically and helping it settle without overshooting.

This fails for non-smooth losses like absolute error: the gradient magnitude stays fixed (it’s on each side of a kink) right up to the minimum, so the step size doesn’t shrink and you oscillate — exactly what happens in the worked example above. Smooth, convex losses are forgiving; kinked losses are not.

Worked example: 2D regression on penguin counts

The 1D toy example uses a single parameter, but real models have many. Here’s the same algorithm applied to a two-parameter linear regression — a much closer analogue of how it’s used in practice.

Three penguin counts taken on an island: 332 in 2010, 353 in 2012, 419 in 2016. Re-express the year as “years since 2010”, so the data is .

Fit a linear model with squared or absolute error loss; here we use absolute error. Initialise , , learning rate .

Forward (loss):

Sample
10332350
22353370
36419410

Total loss .

Gradient. For absolute error, and . Sum over samples:

Update.

New loss. With : predictions become ; absolute errors ; new total .

The loss dropped from to — one gradient step moved the parameters in a direction that improved the fit. Notice how each sample contributes a separate signed term to each gradient component, and the sample with the largest negative residual (predicted too low) pulls upward most strongly. That’s the per-sample structure of in action.

Extending beyond 1D

The algorithm is dimension-agnostic. For linear regression with inputs , the parameter vector is , and the gradient has components. The update equation is the same — it just applies element-wise:

This is why gradient descent scales to networks with billions of parameters: the algorithm itself doesn’t care whether you’re tuning one parameter or one billion.

ASIDE — Computing the gradient efficiently is a separate problem

The update equation assumes you can compute . For a single neuron this is a direct chain-rule calculation. For a deep network with billions of parameters, doing it naively is intractable. The algorithm that makes this tractable — exploiting the network’s layered structure to compute all partial derivatives in a single backwards pass — is backpropagation, covered in week 3. Gradient descent answers “what do we do with ?”; backpropagation answers “how do we compute in the first place?“.

The differentiability requirement

Gradient descent is powerless if everywhere, because then the update term vanishes and nothing changes. This is what breaks the vanilla perceptron classifier: the sign function is flat (derivative zero) almost everywhere, so collapses to the zero vector and learning cannot happen.

The rule: your loss function must be differentiable (or at least have a meaningful, mostly-nonzero gradient). For classification this motivates the switch from sign to the sigmoid function, and correspondingly from squared error to binary-cross-entropy.

Put plainly: the derivative is the algorithm’s sense of direction. No derivative → no direction → no learning. This is why modern networks use differentiable activations (sigmoid, tanh, ReLU) and differentiable losses (squared error, cross-entropy) — anything else leaves gradient descent navigating in the dark.

Problems with vanilla gradient descent

  • Expensive. Computing sums over all training samples every iteration — prohibitive when is large. Mitigated by stochastic / mini-batch variants (see gradient-descent-variants).
  • Local minima. The algorithm follows the slope downhill from wherever you started, which might be into a local basin that isn’t the global optimum. In high dimensions we have no way of knowing whether we’ve found the global minimum.

  • Learning rate sensitivity. Too small → painfully slow; too large → oscillation or divergence. See learning-rate.
  • Ravines and saddle points. Vanilla GD oscillates across narrow valleys and stalls on flat plateaus. Momentum and Adam (see gradient-descent-variants) address this.

When to stop

Two common criteria:

  1. Convergence — loss stops improving over successive iterations.
  2. Max iterations — a hyperparameter cap (e.g. 100 epochs). Used in practice because complex models may never fully converge in a reasonable time.

Active Recall