A high-probability upper bound on the test error in terms of the training error and a model-complexity term. With probability at least over the training set, , where is the hypothesis-set size. The first time formal generalization theory says anything useful — and the source of the bias-variance / complexity trade-off.
The Bound
Combining Hoeffding’s inequality with the union bound over a hypothesis set , then rearranging:
with probability at least over the random draw of the training set. There’s also a matching lower bound:
so is bounded on both sides.
The right-hand side has two pieces:
- Training error — what you measure on training data. Want it small.
- Generalization gap — how much test error can exceed training error. Want it small.
Reading the Bound
Each variable in the bound has a clear role:
| Variable | Effect on the bound | What you control |
|---|---|---|
| — training-set size | More → smaller | Get more data |
| — hypothesis-set size | Larger → larger | Choose simpler models |
| — confidence tolerance | Smaller (more confidence) → larger | Trade-off you set |
| — training error | Smaller → tighter bound | Train better |
The square root means halving requires quadrupling . Improving is logarithmic — cheap. Adding hypotheses () is also logarithmic.
The Two Central Questions
The bound splits the learning problem into two:
- Can be close to ? — generalisation question. Favours small (simple models with few hypotheses).
- Can be small? — fitting question. Favours large (complex models with many hypotheses).
These pull in opposite directions:
| Question 1 (gap) | Question 2 (fit) | |
|---|---|---|
| Small | ✓ Small | ✗ Hypothesis set too poor to fit |
| Large | ✗ Large | ✓ Many options, can fit well |
The right balances them. This is the formal source of the bias-variance / model-complexity trade-off — too simple → high bias, too complex → high variance.
The Problem
The bound is useless when , which happens for any continuously parameterised hypothesis set — including linear regression, all SVMs, all neural networks. Naively, the bound says nothing about generalisation for these models.
Fix: replace with a “complexity” measure that’s finite even when . The intuition: many hypotheses in are nearly identical from the perspective of a fixed training set. When two hypotheses agree on every training point, treating them as separate hypotheses in the union bound is wasteful — the corresponding “bad events” overlap massively.
The recipe (preview, formalised in later weeks):
- Define a dichotomy as a hypothesis “as seen by” a specific set of inputs — i.e., the labelling pattern it produces.
- The number of distinct dichotomies is at most for binary classification — finite for any finite .
- For “well-behaved” (e.g., lines in ), this count is much smaller than , and grows polynomially in rather than exponentially.
- Replace with this growth function .
For lines in , the counts are:
| Max dichotomies (effective M) | ||
|---|---|---|
| 1 | 2 | 2 |
| 2 | 4 | 4 |
| 3 | 8 (or 6 if collinear) | 8 |
| 4 | 14 (not 16) | 16 |
| 5 | 22 | 32 |
At , “lines in ” can no longer realise every possible labelling. This restriction is what saves us — the VC dimension, which formalises this idea, is the topic of upcoming weeks.
What Bounds Imply
- Worst-case guarantee. tells you the worst test error you might see. If you can manage that, you’re safe.
- Non-vacuous bounds are hard. For deep networks (huge ), the simple Hoeffding bound is essentially — useless. Practitioners rely on cross-validation rather than theoretical bounds for small-sample estimates of .
- The bound is not tight. Real models often generalise far better than the bound predicts. The bound is a sufficient condition, not a description of empirical reality.
Worked Example
For , , :
So with 415 examples, we have 95% confidence () that is within 10% of for any hypothesis chosen from a 100-element set.
What Could Go Wrong
- Distribution shift. The bound assumes training and test data are i.i.d. from the same distribution. Real-world deployments usually involve some shift.
- Vacuous bounds for complex models. for deep networks is astronomical; the bound says nothing useful. Tighter notions (Rademacher complexity, PAC-Bayes, VC dimension) are needed.
- Interpreting as a probability. is an error margin, not a probability. The probability is .
- Multiple comparisons. If you tune hyperparameters by cross-validation, you’re effectively choosing among more hypotheses — increase accordingly when interpreting the bound.
Connections
- hoeffding-inequality — the per-hypothesis bound that, after a union bound, becomes this generalisation bound.
- ridge-regression — one way to control effective (regularisation tightens the hypothesis set).
- non-linear-transformation — basis expansion increases effective , raising the gap term.
- support-vector-machine — margin-based bounds replace with margin-related quantities; SVMs work even though .
Active Recall
Why does the bound contain rather than itself?
Because the union bound multiplies probabilities by : . The base probability is . Setting and solving for gives . The appears because we’re inverting an exponential.
A model has , , , . What's the upper bound on ?
. So with 95% confidence. Test error could be up to ~11% even though training error is 5%.
Explain why the bound is useless for , and what's the standard fix.
When is infinite, is infinite, so the bound says could be arbitrarily larger than — meaningless. The fix is to replace with the number of distinct dichotomies — labellings that can produce on a specific training set of points. This count is at most (finite for any finite ), and for “structured” hypothesis sets (like lines in ) grows polynomially in rather than exponentially. The formal version of this is the VC dimension.