A model overfits when, in the process of reducing training error , it increases test error . The resulting fit captures noise and idiosyncrasies of the specific training set rather than the underlying structure of the target. Symbolically: low , high — bad generalisation. The dual failure is underfitting, where both errors are high because the model class is too restricted to capture .

Diagnosis

Overfitting and underfitting live at opposite ends of the bias–variance spectrum:

SymptomDiagnosis
low, highOverfitting (high variance, low bias)
Both errors high and similarUnderfitting (high bias, low variance)
Both errors low and similarSweet spot — correctly specified model with enough data
Both errors low, differentPossible data leak; investigate

The defining feature of overfitting is the gap between training and test performance. A model with that gets every training point exactly right but predicts wildly on new data is the classic example: a degree-10 polynomial through 10 noisy points has and in the thousands.

Causes

The lecture lists six common causes. Each compounds: real overfitting is usually multiple at once.

  • Model too complex. A deep network with millions of parameters trained on 1{,}000 images memorises the images rather than learning generalisable patterns.
  • Too little training data. A sentiment classifier trained on 50 customer reviews fits unique phrases instead of sentiment patterns; with 50{,}000 it generalises.
  • Too many training epochs. Training error keeps falling while validation error bottoms out and then climbs. Early-stopping, the standard fix, is also a form of regularisation.
  • Lack of regularisation. Without constraints, weights are free to grow large to fit specific outliers — reducing at the cost of generalisation.
  • High-variance features or noisy labels. Random “ID number” features can become decision-tree splits; mislabelled examples can drag the boundary.
  • Poor data processing. Unstandardised inputs cause distance-based methods (SVM, k-NN) to weight high-magnitude features arbitrarily; data leakage between training and validation invalidates the whole exercise.

The Two Learners

A vivid framing from the lecture. Target is degree-10 with very small noise; training points. Two learners:

  • Learner (Overfit). Picks — uses a degree-10 polynomial to “match” the target’s complexity.
  • Learner (Restrict). Picks — uses a degree-2 polynomial despite knowing the target is degree-10.

“gives up” the ability to fit the target exactly. Yet:

  • Noisy low-order target ( degree-2 + noise): gets ; gets .
  • Noiseless high-order target ( degree-10, no noise): gets ; gets .

wins by a lot, even when its hypothesis class can’t represent the truth. With only 15 points, the high-capacity learner has so much variance that the bias saving evaporates.

The lesson is uncomfortable: using the right hypothesis class isn’t enough; you need enough data to constrain it. With insufficient , deliberate underfitting beats accurate complexity-matching. This is the bias–variance trade-off shouting in your ear.

Stochastic vs Deterministic Noise

Two kinds of “noise” drive overfitting:

Stochastic noise is randomness in the labels: with . Truly random, irreducible, the same kind of noise that gave rise to the floor in the bias–variance decomposition.

Deterministic noise is the part of the target that the hypothesis set cannot represent. If is degree-50 and (degree-2 polynomials), the residual (best degree-2 approximation) is the deterministic noise: a fixed function of , perfectly predictable in principle, but invisible to the chosen model class.

Both look the same to the learner. The trained model can’t tell whether the residuals come from random label noise or from the target’s structural complexity exceeding . From the learner’s perspective, both are “stuff I can’t fit”, and both pull the fit towards spurious patterns when capacity is high relative to data.

The four impact factors of overfitting (from the lecture’s heat-map experiment):

FactorDirection
Data size Overfitting ↑
Stochastic noise Overfitting ↑
Deterministic noise ↑ (target complexity)Overfitting ↑
Excessive model powerOverfitting ↑

Why MLE Cannot Prevent It

MLE minimises in-sample negative log-likelihood. It has no mechanism to prefer “simpler” hypotheses over more complex ones — the likelihood always favours whatever maximises in-sample fit. So an MLE-fit polynomial of high enough degree will interpolate every training point and overfit terribly.

The Bayesian remedy is to combine the likelihood with a prior on that penalises large weights. The MAP estimate then balances likelihood (fit) against prior (simplicity) — exactly ridge regression for a Gaussian prior. This is the probabilistic-mechanics view of why regularisation works.

What to Do

The lecture’s two cures are the standard practical toolkit:

  1. Regularisation (“putting the brakes”). Add a penalty term that biases the optimiser towards simpler hypotheses. L2 (ridge), L1 (lasso), early stopping, dropout — all are regularisation in disguise.
  2. Validation / cross-validation (“checking the bottom line”). Hold out a portion of the data, fit on the rest, and select hyperparameters (model class, , epochs) by held-out performance. Measures rather than assumes that you’re not overfitting.

The Bayesian connection ties these together: regularisation is MAP with a prior, validation is empirical model selection. Both attack the same enemy from different angles.

Active Recall