A high-probability upper bound on the test error in terms of the training error and a model-complexity term. With probability at least over the training set, , where is the hypothesis-set size. The first time formal generalization theory says anything useful — and the source of the bias-variance / complexity trade-off.

The Bound

Combining Hoeffding’s inequality with the union bound over a hypothesis set , then rearranging:

with probability at least over the random draw of the training set. There’s also a matching lower bound:

so is bounded on both sides.

The right-hand side has two pieces:

  1. Training error — what you measure on training data. Want it small.
  2. Generalization gap — how much test error can exceed training error. Want it small.

Reading the Bound

Each variable in the bound has a clear role:

VariableEffect on the boundWhat you control
— training-set sizeMore → smaller Get more data
— hypothesis-set sizeLarger → larger Choose simpler models
— confidence toleranceSmaller (more confidence) → larger Trade-off you set
— training errorSmaller → tighter boundTrain better

The square root means halving requires quadrupling . Improving is logarithmic — cheap. Adding hypotheses () is also logarithmic.

The Two Central Questions

The bound splits the learning problem into two:

  1. Can be close to ? — generalisation question. Favours small (simple models with few hypotheses).
  2. Can be small? — fitting question. Favours large (complex models with many hypotheses).

These pull in opposite directions:

Question 1 (gap)Question 2 (fit)
Small✓ Small ✗ Hypothesis set too poor to fit
Large✗ Large ✓ Many options, can fit well

The right balances them. This is the formal source of the bias-variance / model-complexity trade-off — too simple → high bias, too complex → high variance.

The Problem

The bound is useless when , which happens for any continuously parameterised hypothesis set — including linear regression, all SVMs, all neural networks. Naively, the bound says nothing about generalisation for these models.

Fix: replace with a “complexity” measure that’s finite even when . The intuition: many hypotheses in are nearly identical from the perspective of a fixed training set. When two hypotheses agree on every training point, treating them as separate hypotheses in the union bound is wasteful — the corresponding “bad events” overlap massively.

The recipe (preview, formalised in later weeks):

  1. Define a dichotomy as a hypothesis “as seen by” a specific set of inputs — i.e., the labelling pattern it produces.
  2. The number of distinct dichotomies is at most for binary classification — finite for any finite .
  3. For “well-behaved” (e.g., lines in ), this count is much smaller than , and grows polynomially in rather than exponentially.
  4. Replace with this growth function .

For lines in , the counts are:

Max dichotomies (effective M)
122
244
38 (or 6 if collinear)8
414 (not 16)16
52232

At , “lines in ” can no longer realise every possible labelling. This restriction is what saves us — the VC dimension, which formalises this idea, is the topic of upcoming weeks.

What Bounds Imply

  • Worst-case guarantee. tells you the worst test error you might see. If you can manage that, you’re safe.
  • Non-vacuous bounds are hard. For deep networks (huge ), the simple Hoeffding bound is essentially — useless. Practitioners rely on cross-validation rather than theoretical bounds for small-sample estimates of .
  • The bound is not tight. Real models often generalise far better than the bound predicts. The bound is a sufficient condition, not a description of empirical reality.

Worked Example

For , , :

So with 415 examples, we have 95% confidence () that is within 10% of for any hypothesis chosen from a 100-element set.

What Could Go Wrong

  • Distribution shift. The bound assumes training and test data are i.i.d. from the same distribution. Real-world deployments usually involve some shift.
  • Vacuous bounds for complex models. for deep networks is astronomical; the bound says nothing useful. Tighter notions (Rademacher complexity, PAC-Bayes, VC dimension) are needed.
  • Interpreting as a probability. is an error margin, not a probability. The probability is .
  • Multiple comparisons. If you tune hyperparameters by cross-validation, you’re effectively choosing among more hypotheses — increase accordingly when interpreting the bound.

Connections

  • hoeffding-inequality — the per-hypothesis bound that, after a union bound, becomes this generalisation bound.
  • ridge-regression — one way to control effective (regularisation tightens the hypothesis set).
  • non-linear-transformation — basis expansion increases effective , raising the gap term.
  • support-vector-machine — margin-based bounds replace with margin-related quantities; SVMs work even though .

Active Recall