THE CRUX: How do we train a network that has more than one layer — and once we can, how do we stop it from just memorising the training data?

Stacking perceptrons into layers unlocks non-linear decision boundaries but makes gradient computation vastly harder. The trick is to represent the network as a computation graph and compute all gradients in a single backward pass — that’s backpropagation, and it rests entirely on the chain rule. Once training works, a new problem appears: powerful models tend to memorise rather than generalise, so we need explicit tools (data splits, early stopping, regularisation) to stop them.

Where we left off

After week 2 we had a trained single perceptron: a loss-function via max likelihood, gradient descent to minimise it, and a sigmoid function activation that makes the classification case differentiable. Good for linearly separable data, good for regression with a linear trend.

But take one look at non-linearly-separable data — XOR, concentric circles, spirals — and the single perceptron fails. No straight line exists that separates the classes. The model’s representational capacity is simply too low.

Stacking perceptrons: the MLP

The fix is to combine multiple perceptrons into layers, and stack multiple layers. Each layer’s outputs feed the next layer’s inputs, and the composition can represent wildly non-linear decision surfaces. This is the multi-layer-perceptron (MLP) — also called a feed-forward neural network, because information flows in one direction, input to output.

The structure:

  • Input layer — just a placeholder for the features, no computation.
  • Hidden layers — one or more layers of sigmoid (or other smooth) perceptrons.
  • Output layer — one neuron for regression / binary classification, neurons for -class classification.

The notation is worth committing to memory, because it recurs in every subsequent week:

Superscripts are layer indices; subscripts are unit indices. at the input layer.

The training problem: gradient descent still works, but computing the gradient is hard

The optimisation framework hasn’t changed. We still minimise a loss function with gradient descent:

The catch is . For a single perceptron it’s a short chain-rule calculation. For an MLP with millions of parameters, direct computation is both:

  • Infeasible by hand — you can’t write down a closed-form derivative for every parameter.
  • Wasteful if done naively — many intermediate derivatives get reused across different parameters’ gradients, so recomputing them is exponentially slow.

The loss depends on every parameter, because the forward pass chains them all together. A weight in layer 2 influences activations in layer 3, which influence layer 4, which influence the final output. The output depends on all the parameters, not just the last layer.

The computation graph: making derivatives mechanical

To systematically compute gradients, represent the whole network as a computation-graph — a directed acyclic graph of elementary operations. Every intermediate quantity gets its own node; every arrow is one elementary operation (addition, multiplication, activation, squaring, etc.).

The squared-error loss becomes a chain of five nodes — — each connected by a single operation. A whole MLP becomes a bigger graph with branching, but built from the same elementary pieces.

Once the graph is drawn, the chain rule is mechanical. Walk any path from parameter to loss, multiply the local derivatives along the way, and you have the gradient.

Chain rule: univariate and multivariate

Recall the univariate chain rule: if and , then

But in a computation graph, a single variable often influences the output through multiple paths. For , the variable reaches through both and . The multivariate chain rule says: sum contributions from every path.

In a neural network, early-layer weights routinely influence the output through many paths (every downstream neuron they touch). The multivariate chain rule is what correctly accumulates these contributions.

Backpropagation: the two-pass algorithm

Backpropagation is the algorithm that walks the computation graph twice:

Forward pass — input to output. At each node, compute its value from its predecessors and cache it. At the end, you have the loss and all intermediate activations.

Backward pass — output back to input. At each node , compute using the multivariate chain rule:

Crucially, because we go backward, is already computed by the time we reach . We just multiply by local derivatives (which use cached forward-pass values) and sum.

Worked example

For with , , :

Forward: , , .

Backward: ; ; ; ; ; (the two paths combine via multivariate chain rule).

Every intermediate gradient is computed once and reused for every predecessor. That’s the efficiency win that makes training a billion-parameter network possible.

TIP — Backpropagation is not a new algorithm, it's a reorganisation

Backprop doesn’t invent any new math — it’s the chain rule applied systematically on a graph. What makes it clever is the order of computation: go backward so each gradient is available when its predecessors need it, cache every intermediate result, and the whole thing runs in time linear in the number of graph nodes. The same principle underlies every modern automatic-differentiation framework (PyTorch, TensorFlow, JAX).

Training loop

Putting it all together, one training step is:

  1. Forward pass: compute from inputs and current weights. Cache all intermediate activations.
  2. Compute loss: .
  3. Backward pass: walk the graph backwards, computing over all parameters.
  4. Update: .
  5. Repeat until convergence or the iteration budget.

At inference time (using a trained network to predict), you only need the forward pass. The backward pass and the weight updates are training-only machinery; once training finishes, you throw them away and keep the learned weights.

Multi-class classification: softmax

Binary classification pairs sigmoid with binary-cross-entropy. Multi-class classification (digit recognition, letter recognition, etc.) generalises both. The output layer has neurons — one per class — and produces raw scores . softmax converts these scores into a probability distribution:

Outputs are positive and sum to 1, so they’re readable as . Labels are represented as one-hot vectors of length : a 1 in the true-class position, 0s elsewhere. And the loss is the multi-class version of cross-entropy — exactly the same MLE derivation as binary cross-entropy, just over a categorical distribution instead of Bernoulli.

Once training works, generalisation becomes the real problem

With backpropagation and softmax, you can now train arbitrarily deep networks for arbitrary numbers of classes. The next problem: the model might memorise the training set rather than learn generalisable patterns. This is overfitting.

Underfitting — model too simple, can’t even fit the training data. Overfitting — model too powerful, fits training data near-perfectly including the noise, fails on unseen data. Good fit — captures the underlying pattern, tolerates noise.

ASIDE — The student analogy

A student who memorises past exam answers but doesn’t understand the material fails on new questions. That’s overfitting. A student who doesn’t study at all fails both familiar and new questions. That’s underfitting. You want the student who understands the concepts well enough to solve problems they’ve never seen.

The key observation: the model’s training loss measures memorisation; only the test loss (on unseen data) measures generalisation. If the two diverge — training loss low, test loss high — you’ve overfit.

Three-way data split

To detect overfitting and control it, partition your data into three disjoint sets:

SplitFractionPurpose
Training~60%Fit the weights with gradient descent
Validation~20%Tune hyperparameters, pick the stopping point
Test~20%Final one-shot evaluation

Each split has its own job and must not leak into another. Training on the test set compromises your final evaluation (it’s no longer “unseen”). Training on the validation set compromises the signal you use to tune hyperparameters (the validation loss stops being a faithful estimate of generalisation).

TIP — It's just like a university term

Training data = practice exercises. Validation = mid-term assessments. Test = final exam. The final-exam rule — you don’t study from the exam beforehand — is the test-set rule.

Fixing overfitting: early stopping and regularisation

Early stopping: monitor validation loss during training. It drops initially (the model is learning useful patterns), reaches a minimum, then rises (the model starts memorising noise). Stop training at the minimum — those weights generalise best.

Regularisation (specifically, weight decay): add an explicit penalty on weight magnitudes to the loss:

The hyperparameter controls the strength of the penalty. The optimiser now has to balance fitting the data against keeping weights small. Smaller weights → smoother decision boundary → better generalisation.

The two techniques are complementary: early stopping limits how long the model is allowed to overfit; regularisation limits how extreme the weights can become. Both are used together in modern practice.

Summary: the week 3 pipeline

We now have the full training pipeline for deep networks:

  1. Architecture: stack perceptrons into a multi-layer network.
  2. Representation: model the network as a computation graph.
  3. Training: forward pass to compute outputs, backward pass (backprop) to compute gradients, gradient descent to update weights.
  4. Output heads: sigmoid + binary cross-entropy for binary classification, softmax + categorical cross-entropy for multi-class, linear + MSE for regression.
  5. Generalisation: split data into train/val/test, monitor validation loss, use early stopping and regularisation.

Concepts introduced this week

  • multi-layer-perceptron — architecture of stacked perceptron layers; handles non-linearly-separable data
  • computation-graph — DAG representation of a function as nodes (variables) and edges (operations)
  • backpropagation — efficient gradient computation via forward and backward passes over the graph
  • softmax — multi-class output activation producing a probability distribution
  • overfitting — the memorisation failure mode; paired with underfitting and the generalisation ideal
  • regularization — weight decay / L2 penalty to reduce effective model capacity

Connections

  • Builds on week-02: gradient descent still does the optimisation; sigmoid still makes classification differentiable. This week’s novelty is (a) stacking perceptrons and (b) the algorithm that computes gradients efficiently for the stack.
  • Builds on week-01: the MLE framework still derives losses. Binary cross-entropy extends to categorical cross-entropy via the same recipe, now with softmax producing class probabilities.
  • Sets up week 4: convolutional neural networks (CNNs) — a specialised architecture for image inputs that isn’t fully connected. Same backprop, different connection pattern.

Open questions

  • The exact gradient of softmax + categorical cross-entropy was sketched but not derived. Worth working through; the result simplifies to just like sigmoid + binary cross-entropy.
  • Modern deep networks use activations beyond sigmoid — ReLU, tanh, GELU — that avoid the vanishing-gradient problem at depth. Those appear in later weeks but sit on top of the same backprop machinery.
  • Architecture design (how many layers? how wide?) was left as “human trial and error”. Neural Architecture Search exists but is beyond this module’s scope.