generalization-bound

A high-probability upper bound on the test error $E_{out}$ in terms of the training error $E_{in}$ and a model-complexity term. With probability at least $1 - δ$ over the training set, $E_{out} (g) \leq E_{in} (g) + \frac{1}{2 N} lo g \frac{2 M}{δ}$ , where $M$ is the hypothesis-set size. The first time formal generalization theory says anything useful — and the source of the bias-variance / complexity trade-off.

The Bound

Combining Hoeffding’s inequality with the union bound over a hypothesis set $H = {h_{1}, \dots, h_{M}}$ , then rearranging:

$E_{out} (g) \leq E_{in} (g) + \frac{1}{2 N} lo g \frac{2 M}{δ}$

with probability at least $1 - δ$ over the random draw of the training set. There’s also a matching lower bound:

$E_{out} (g) \geq E_{in} (g) - \frac{1}{2 N} lo g \frac{2 M}{δ}$

so $∣ E_{out} - E_{in} ∣$ is bounded on both sides.

The right-hand side has two pieces:

Training error $E_{in} (g)$ — what you measure on training data. Want it small.
Generalization gap $ϵ = \frac{1}{2 N} lo g \frac{2 M}{δ}$ — how much test error can exceed training error. Want it small.

Reading the Bound

Each variable in the bound has a clear role:

Variable	Effect on the bound	What you control
$N$ — training-set size	More $N$ → smaller $ϵ$	Get more data
$M$ — hypothesis-set size	Larger $M$ → larger $ϵ$	Choose simpler models
$δ$ — confidence tolerance	Smaller $δ$ (more confidence) → larger $ϵ$	Trade-off you set
$E_{in}$ — training error	Smaller $E_{in}$ → tighter bound	Train better

The square root means halving $ϵ$ requires quadrupling $N$ . Improving $δ$ is logarithmic — cheap. Adding hypotheses ( $M$ ) is also logarithmic.

The Two Central Questions

The bound splits the learning problem into two:

Can $E_{out} (g)$ be close to $E_{in} (g)$ ? — generalisation question. Favours small $M$ (simple models with few hypotheses).
Can $E_{in} (g)$ be small? — fitting question. Favours large $M$ (complex models with many hypotheses).

These pull in opposite directions:

$M$	Question 1 (gap)	Question 2 (fit)
Small	✓ Small $ϵ$	✗ Hypothesis set too poor to fit
Large	✗ Large $ϵ$	✓ Many options, can fit well

The right $M$ balances them. This is the formal source of the bias-variance / model-complexity trade-off — too simple → high bias, too complex → high variance.

The $M = \infty$ Problem

The bound is useless when $M = \infty$ , which happens for any continuously parameterised hypothesis set — including linear regression, all SVMs, all neural networks. Naively, the bound says nothing about generalisation for these models.

Fix: replace $M$ with a “complexity” measure that’s finite even when $∣ H ∣ = \infty$ . The intuition: many hypotheses in $H$ are nearly identical from the perspective of a fixed training set. When two hypotheses agree on every training point, treating them as separate hypotheses in the union bound is wasteful — the corresponding “bad events” overlap massively.

The recipe (preview, formalised in later weeks):

Define a dichotomy as a hypothesis “as seen by” a specific set of $N$ inputs — i.e., the labelling pattern $(h (x_{1}), \dots, h (x_{N}))$ it produces.
The number of distinct dichotomies $H (x_{1}, \dots, x_{N})$ is at most $2^{N}$ for binary classification — finite for any finite $N$ .
For “well-behaved” $H$ (e.g., lines in $R^{2}$ ), this count is much smaller than $2^{N}$ , and grows polynomially in $N$ rather than exponentially.
Replace $M$ with this growth function $m_{H} (N)$ .

For lines in $R^{2}$ , the counts are:

$N$	Max dichotomies (effective M)	$2^{N}$
1	2	2
2	4	4
3	8 (or 6 if collinear)	8
4	14 (not 16)	16
5	22	32

At $N = 4$ , “lines in $R^{2}$ ” can no longer realise every possible labelling. This restriction is what saves us — the VC dimension, which formalises this idea, is the topic of upcoming weeks.

What Bounds Imply

Worst-case guarantee. $E_{out} \leq E_{in} + ϵ$ tells you the worst test error you might see. If you can manage that, you’re safe.
Non-vacuous bounds are hard. For deep networks (huge $M$ ), the simple Hoeffding bound is essentially $\infty$ — useless. Practitioners rely on cross-validation rather than theoretical bounds for small-sample estimates of $E_{out}$ .
The bound is not tight. Real models often generalise far better than the bound predicts. The bound is a sufficient condition, not a description of empirical reality.

Worked Example

For $ϵ = 0.1$ , $δ = 0.05$ , $M = 100$ :

$N \geq \frac{1}{2 ϵ ^{2}} lo g \frac{2 M}{δ} = \frac{1}{0.02} lo g \frac{200}{0.05} \approx 50 \cdot 8.29 \approx 415$

So with 415 examples, we have 95% confidence ( $δ = 0.05$ ) that $E_{out}$ is within 10% of $E_{in}$ for any hypothesis chosen from a 100-element set.

What Could Go Wrong

Distribution shift. The bound assumes training and test data are i.i.d. from the same distribution. Real-world deployments usually involve some shift.
Vacuous bounds for complex models. $M$ for deep networks is astronomical; the bound says nothing useful. Tighter notions (Rademacher complexity, PAC-Bayes, VC dimension) are needed.
Interpreting $ϵ$ as a probability. $ϵ$ is an error margin, not a probability. The probability is $1 - δ$ .
Multiple comparisons. If you tune hyperparameters by cross-validation, you’re effectively choosing among more hypotheses — increase $M$ accordingly when interpreting the bound.

Connections

hoeffding-inequality — the per-hypothesis bound that, after a union bound, becomes this generalisation bound.
ridge-regression — one way to control effective $M$ (regularisation tightens the hypothesis set).
non-linear-transformation — basis expansion increases effective $M$ , raising the gap term.
support-vector-machine — margin-based bounds replace $M$ with margin-related quantities; SVMs work even though $∣ H ∣ = \infty$ .

Active Recall

Why does the bound contain $lo g M$ rather than $M$ itself?

Because the union bound multiplies probabilities by $M$ : $P (⋃_{i} B_{i}) \leq M P (B_{1})$ . The base probability is $2 e^{- 2 ϵ^{2} N}$ . Setting $2 M e^{- 2 ϵ^{2} N} = δ$ and solving for $ϵ$ gives $ϵ = (1/2 N) lo g (2 M / δ)$ . The $lo g$ appears because we’re inverting an exponential.

A model has $E_{in} = 0.05$ , $N = 1000$ , $M = 50$ , $δ = 0.05$ . What's the upper bound on $E_{out}$ ?

$ϵ = (1/2000) lo g (100/0.05) = (1/2000) lo g (2000) = 0.0005 \cdot 7.6 = 0.0038 \approx 0.062$ . So $E_{out} \leq 0.05 + 0.062 = 0.112$ with 95% confidence. Test error could be up to ~11% even though training error is 5%.

Explain why the bound is useless for $M = \infty$ , and what's the standard fix.

When $M$ is infinite, $lo g M$ is infinite, so the bound says $E_{out}$ could be arbitrarily larger than $E_{in}$ — meaningless. The fix is to replace $M$ with the number of distinct dichotomies — labellings that $H$ can produce on a specific training set of $N$ points. This count is at most $2^{N}$ (finite for any finite $N$ ), and for “structured” hypothesis sets (like lines in $R^{2}$ ) grows polynomially in $N$ rather than exponentially. The formal version of this is the VC dimension.

Course Notes

Explorer

generalization-bound

The Bound

Reading the Bound

The Two Central Questions

The $M = \infty$ Problem

What Bounds Imply

Worked Example

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

generalization-bound

The Bound

Reading the Bound

The Two Central Questions

The M=∞ Problem

What Bounds Imply

Worked Example

What Could Go Wrong

Connections

Active Recall

Graph View

Table of Contents

Backlinks

The $M = \infty$ Problem