The ability of a learned hypothesis to perform well on new, unseen data drawn from the same distribution as the training set — the property that separates genuine learning from memorization.

Definition

A hypothesis generalizes if its error on the full data distribution is low, not just its error on the finite training set . Formally, the generalization error (or test error) is:

  • Classification: (misclassification probability)
  • Regression: (expected squared error)

We never observe directly because we don’t have access to . We can only estimate it from held-out test data.

Why Memorization Fails

A model that perfectly fits every training point may simply be reciting the training set rather than capturing the underlying pattern. Consider a lookup table that maps each training input to its training label: it achieves zero training error but tells us nothing about inputs it hasn’t seen. The gap between training error (error on ) and generalization error (error on the full distribution) is the central tension in machine learning.

What Makes Generalization Possible

Two assumptions make generalization tractable:

  1. i.i.d. sampling — Training and test data are drawn independently from the same distribution . If the test distribution differs (distribution shift), generalization guarantees break down.
  2. Restricted hypothesis set — The learning algorithm searches within a limited family , not over all possible functions. A smaller is easier to learn from limited data but risks excluding the true function. This trade-off is formalized by VC dimension and bias–variance analysis (covered in weeks 8–10).

Two Pillars: Closeness and Attainability

A useful way to break the generalization problem in two: it succeeds only when both of the following hold.

ASIDE — The marble jar

Imagine a giant jar full of marbles, where the fraction of red marbles is — the true error rate of our hypothesis on the entire distribution. We can’t count them all. Instead we pull out a small handful (the training set) and measure , the fraction of red marbles in our hand — the in-sample error . Generalization rests on whether tells us anything reliable about .

Pillar 1 — Closeness: Is the in-sample error a faithful estimate of the out-of-sample error? That is, . This is a property of sampling and hypothesis-set complexity, not of the model’s accuracy. The i.i.d. assumption gives us probabilistic tools (Hoeffding’s inequality, VC theory) that bound how far apart and can be. Roughly: with enough data and a not-too-flexible , the marble fraction in our hand reflects the fraction in the jar.

Pillar 2 — Attainability: Did we actually find a hypothesis with low ? This is what the learning algorithm delivers — picking a that fits the training data well. A hypothesis set that’s too restrictive may make low unreachable (underfitting); a hypothesis that we couldn’t find via optimisation also fails this pillar.

Both pillars are necessary. Either alone is useless:

ClosenessAttainabilityOutcome
✓ ()✓ (low )Learning succeeded — low test error
✗ (high )Underfitting — model too simple
✗ ()Overfitting — memorised noise
Total failure

The two pillars are in tension. Making richer (more flexible) helps attainability — there’s a better hypothesis available — but hurts closeness, because more flexible models can fit training noise and pull below . This is exactly the bias-variance / capacity-control trade-off; the sweet spot balances both pillars.

Connection to Occam’s Razor

Simpler hypotheses — those from smaller hypothesis sets — are less likely to be fitting noise and more likely to generalize. This is an informal statement of what VC theory and regularization make precise later in the module.

Active Recall