The central tension of machine learning: a model that’s too weak can’t capture the pattern, but a model that’s too strong memorises the training examples rather than learning the underlying rule. Either way, performance on new data is bad.

The spectrum: underfitting, good fit, overfitting

RegimeModel capacityTraining performanceTest performanceSymptom
UnderfitToo lowPoorPoorCan’t capture the trend — training loss stays high
Good fitMatched to problemReasonableReasonableCaptures the pattern, tolerates noise
OverfitToo highNear-perfectPoorMemorises training data including noise

The classic illustration:

  • Underfit classification: a single straight line drawn through a dataset with a clearly curved decision boundary. Lots of misclassified training points, and new points are mispredicted in exactly the same way.
  • Overfit classification: a wildly wiggly boundary that zig-zags to classify every training point correctly, including points that are probably noise. New points fall into the wrong pockets of the wiggly region.
  • Good fit: a smooth curve that follows the general shape of the data without chasing individual noisy points.

The same pattern in regression: underfit = straight line through curved data, overfit = high-degree polynomial that threads through every training point, good fit = smooth curve capturing the underlying signal.

Generalisation is the real goal

Training loss measures memorisation; test loss measures generalisation — how well the model performs on data it’s never seen. Since we deploy the model to new inputs, only generalisation matters.

TIP — The student analogy

A student who memorises every exam question from past years but doesn’t understand the concepts will fail on new questions. That’s overfitting — perfect recall of training data, poor performance on unseen material. A student who doesn’t study at all fails both. That’s underfitting. You want the student who understands the concepts well enough to solve problems they’ve never seen — that’s generalisation.

Causes

Underfitting:

  • Model too simple (not enough parameters, wrong architecture).
  • Trained for too few iterations.
  • Features too impoverished to contain the signal.

Overfitting:

  • Model too powerful (too many parameters relative to training data).
  • Too few training examples for the model’s capacity.
  • Trained for too long, fitting noise once the signal is captured.
  • No regularisation.

More complex models (deeper / wider networks, higher-degree polynomials) can only overfit. They can’t underfit in the capacity sense, because they can always shrink to mimic a simpler model. The failure mode shifts from “too rigid” to “too flexible”.

The three-way data split

To detect and control overfitting, split your dataset into three disjoint partitions:

SplitTypical fractionPurpose
Training~60%Gradient descent updates the weights using this data
Validation~20%Monitor generalisation during training — tune hyperparameters, pick stopping point, compare models
Test~20%Final, one-shot evaluation after all training decisions are locked in

Key rule: each split has a specific purpose and must not leak into another.

  • Training on the test set → the test evaluation no longer reflects generalisation. The test set would become just another piece of training data.
  • Training on the validation set → you’d overfit to the validation set, and its measurements would stop being a faithful estimate of generalisation.
  • Evaluating on the training set → tells you nothing about generalisation; high training accuracy is compatible with severe overfitting.

TIP — The university-exam analogy

Training data is like practice exercises you study from. Validation data is like mid-term assessments — internal checks during the term. Test data is the final exam — you only see it once, and you don’t study from it beforehand, or the evaluation is compromised. The final-exam rule (“no peeking”) is exactly the rule for test sets.

Fixing underfitting

Use a more powerful model: more layers, more neurons per layer, richer features. The underlying issue is “the model cannot represent the pattern”, so give it more representational capacity.

Fixing overfitting

Several levers, usually used in combination:

  1. More training data. The single most reliable fix. A model with parameters overfits when training set size ; adding examples increases the ratio of signal to noise.
  2. Less powerful model. Fewer parameters (shallower / narrower network, lower-degree polynomial) gives the model less room to memorise.
  3. Early stopping. Stop training before the model has had time to overfit (see below).
  4. Regularisation. Add an explicit penalty on weight magnitude to the loss. See regularization.
  5. dropout — randomly disable neurons during training to prevent over-reliance on specific units.
  6. data-augmentation — synthesise extra training examples by applying label-preserving transformations.
  7. transfer-learning — start from weights pre-trained on a large dataset, dramatically reducing the effective data requirement.

Early stopping

Monitor training loss and validation loss at each epoch:

  • Training loss typically decreases monotonically — the model keeps memorising more of the training set.
  • Validation loss decreases initially (the model is learning generalisable patterns), then bottoms out, then increases as the model starts overfitting to training-set noise.

The minimum of the validation loss is the sweet spot: further training improves training loss but hurts generalisation. Early stopping picks the epoch where validation loss is lowest and uses those weights as the final model.

Visually:

  • Both losses decrease together in the “learning useful patterns” phase.
  • They diverge at the “overfitting starts” point — training loss keeps falling, validation loss starts rising.
  • Stop at the divergence point.

ASIDE — Early stopping is a form of regularisation

Even without an explicit penalty term, limiting the number of training iterations constrains how thoroughly the model can fit the training data. Stopping early is equivalent to restricting the effective capacity of the model, which is exactly what other regularisers do.

  • regularization — an explicit loss term that discourages overfitting
  • dropout — randomly disable neurons during training; an alternative regulariser
  • data-augmentation — synthesise more training examples to reduce overfitting from a small dataset
  • transfer-learning — bypass the data-hunger problem by starting from pre-trained weights
  • gradient descent — the process whose stopping point early stopping controls
  • loss-function — the training-loss metric; validation/test losses use the same formula on different data
  • multi-layer-perceptron — the architecture whose capacity is the main dial for overfitting

Active Recall