From M to VC Dimension, and Bias–Variance as the Average Case

THE CRUX: Last week's generalisation bound $P [∣ E_{in} - E_{out} ∣ > ϵ] \leq 2 M e^{- 2 ϵ^{2} N}$ said that learning generalises if the hypothesis set is finite. But every model we actually use — perceptrons, SVMs, neural networks — has infinite $∣ H ∣$ , so naively the bound says nothing. (1) What is the right finite quantity to replace $M$ with, and how does it lead to a non-vacuous bound? (2) Worst-case bounds give pessimistic guarantees that hold uniformly over all distributions. Is there a complementary, average-case picture of generalisation that's more useful in practice?

The two halves of week 9 each answer one. (1) The fix is to count dichotomies — distinct labellings $H$ produces on $N$ specific inputs — rather than distinct hypotheses. The maximum dichotomy count, the growth function $m_{H} (N)$ , is bounded by $2^{N}$ . The VC dimension $d_{VC} (H)$ is the largest $N$ for which $m_{H} (N) = 2^{N}$ — the most points $H$ can shatter. Finite $d_{VC}$ implies polynomial $m_{H} (N)$ , the VC bound $E_{out} \leq E_{in} + (8/ N) lo g (4 m_{H} (2 N) / δ)$ contracts as $N \to \infty$ , and learning is feasible. For the perceptron in $R^{d}$ , $d_{VC} = d + 1$ — the number of free parameters. (2) Switching from 0–1 loss to squared loss enables an exact decomposition of $E_{D} [E_{out}]$ into bias and variance. Bias is how far the average hypothesis $\overset{g}{ˉ}$ is from the truth $f$ ; variance is how much hypotheses fluctuate across training sets. With noise, an additional irreducible $σ^{2}$ term appears. Complex models lower bias but raise variance — the formal trade-off in choosing model complexity, and the average-case companion to VC’s worst-case bound.

Part 1: Replacing $M$ with Something Finite

The Problem

The generalisation bound in week 8 ended with a problem: it depends on $M = ∣ H ∣$ . For finite hypothesis sets the bound is

$P [∣ E_{in} (g) - E_{out} (g) ∣ > ϵ] \leq 2 M e^{- 2 ϵ^{2} N}$

— useful and informative. But every interesting model has $M = \infty$ : linear classifiers in $R^{d}$ have a continuum of weight vectors, so do SVMs and neural networks. Plugging $M = \infty$ in gives a vacuous bound that says nothing. Are infinite hypothesis sets simply doomed?

The answer is no, and the fix is conceptually beautiful: most of the $M$ in the union bound is double-counting. Two hypotheses that label the training set identically have almost identical “bad events” — the events that their training and test errors disagree by more than $ϵ$ . Treating them as $M = 2$ rather than $M = 1$ wastes half the budget.

Dichotomies and the Growth Function

The right quantity is what hypotheses look like on a fixed training set, not what they look like as full functions on $X$ . A dichotomy is a labelling pattern $(h (x_{1}), \dots, h (x_{N}))$ that some $h \in H$ produces on $N$ specific inputs. Two hypotheses with different parameters but identical labellings on the training set are the same dichotomy.

The number of distinct dichotomies on $N$ specific inputs is at most $2^{N}$ — finite even when $∣ H ∣$ is infinite. To remove dependence on the specific input choice, take the maximum over all input sets:

$m_{H} (N) = max_{x_{1}, \dots, x_{N} \in X} ∣ H (x_{1}, \dots, x_{N}) ∣.$

This is the growth function — a property of $H$ alone, distribution-free, algorithm-independent, and bounded by $2^{N}$ .

Examples

Let’s compute $m_{H} (N)$ for some hypothesis sets:

$H$	$m_{H} (N)$
Positive rays $h (x) = sign (x - a)$ on $R$	$N + 1$
Positive intervals $h (x) = + 1$ iff $x \in [a, b]$	$N^{2} /2 + N /2 + 1$
2D perceptrons (lines in $R^{2}$ )	$\leq O (N^{3})$ , with $m_{H} (4) = 14$
Convex sets in $R^{2}$	$2^{N}$ (always)

The first three grow polynomially. The last is exponential — and it stays at $2^{N}$ forever, no matter how large $N$ gets.

Shattering, Break Points, and VC Dimension

If $m_{H} (N) = 2^{N}$ , we say $H$ shatters some $N$ -point input set: it can produce every conceivable labelling of those points. A break-point is the smallest $k$ at which shattering fails, $m_{H} (k) < 2^{k}$ . Once $H$ has any break point, every larger $N$ inherits it.

The VC dimension $d_{VC} (H)$ is the largest $N$ at which $H$ can still shatter — equivalently, one less than the break point.

ASIDE — Why "break point" is the right name

Imagine sliding $N$ from 1 upward. Initially $H$ keeps up with the explosion of binary labellings ( $m_{H} (N) = 2^{N}$ ). At some specific $k$ — the break point — the hypothesis set “breaks”, unable to keep producing every labelling. Once it breaks, it stays broken: the structural restriction propagates. The break point marks the moment the hypothesis set’s geometry kicks in.

For 2D perceptrons: $m_{H} (3) = 8$ (any 3 non-collinear points are shattered) and $m_{H} (4) = 14 < 16$ (the XOR labelling on a square is impossible). Break point 4, $d_{VC} = 3$ .

The Polynomial Bound

The deep fact: a single break point $k$ forces $m_{H} (N)$ to grow polynomially. Concretely (Sauer–Shelah):

$m_{H} (N) \leq \sum_{i = 0}^{d_{VC}} (i N) \leq N^{d_{VC}} + 1.$

This is the moment the entire learning-theory story closes. We started with an exponential $2^{N}$ that ruined the bound; we end with a polynomial $N^{d_{VC}}$ . The VC bound becomes

$P [∣ E_{in} (g) - E_{out} (g) ∣ > ϵ] \leq 4 m_{H} (2 N) e^{- \frac{1}{8} ϵ^{2} N},$

and with polynomial $m_{H}$ the right-hand side $\to 0$ as $N \to \infty$ . Finite VC dimension ⇒ learning generalises.

Apply the Sauer–Shelah bound: a hypothesis set has $d_{VC} = 5$ and you have $N = 100$ samples. Roughly how big can $m_{H} (2 N)$ be?

$m_{H} (2 N) = m_{H} (200) \leq 20 0^{5} + 1 \approx 3.2 \times 1 0^{11}$ . Compared to $2^{200}$ this is astronomically smaller — by a factor of about $1 0^{49}$ . That’s why polynomial growth saves the bound: the hypothesis set may have infinite cardinality, but its effective complexity on any specific dataset is far smaller.

The Perceptron Result

For the perceptron in $R^{d}$ with bias (inputs $x = (1, x_{1}, \dots, x_{d})^{⊤}$ ):

$d_{VC} = d + 1.$

The " $+ 1$ " is the bias term. So the VC dimension equals the number of free parameters — $d$ slope coefficients plus 1 intercept. For “smooth” parameterised hypothesis sets, this rough equivalence (capacity ≈ parameter count) holds approximately.

The Sample Complexity Verdict

Inverting the VC bound: $N \geq \frac{8}{ϵ ^{2}} lo g \frac{4 (( 2 N ) ^{d_{VC}} + 1 )}{δ}$ . This is implicit in $N$ , so iterate.

For $d_{VC} = 3$ , $ϵ = 0.1$ , $δ = 0.1$ : converges to $N \approx 30, 000$ . For $d_{VC} = 4$ : $N \approx 40, 000$ . The pattern: $N \approx 10, 000 \cdot d_{VC}$ in theory.

In practice, the bound is loose. Rule of thumb: $N \approx 10 \cdot d_{VC}$ is usually enough. Real distributions are far more benign than worst-case theory presumes — the bound’s distribution-free guarantee comes at a heavy quantitative price.

Margins, Fat Hyperplanes, and Why SVMs Generalise

A perceptron’s $d_{VC} = d + 1$ scales with the input dimension, which becomes large after polynomial basis expansion (dimension $(d Q + d)$ for degree $Q$ ). Naively, kernelised SVMs have astronomical $d_{VC}$ — yet they generalise well. Why?

Because the SVM doesn’t use all hyperplanes — only those with margin at least $ρ$ . Restricting to fat hyperplanes shrinks the hypothesis set, which shrinks the VC dimension:

$d_{VC} (ρ) \leq ⌈ \frac{R ^{2}}{ρ ^{2}} ⌉ + 1$

where $R$ is the radius of a ball containing the data. This is independent of $d$ — the kernel feature space can have infinite dimension without harming generalisation. The SVM’s geometric criterion (maximise the margin) is exactly the criterion that controls capacity. This is the formal version of “SVMs work because of the margin.”

Part 2: An Average-Case View

Why a Different Decomposition?

The VC bound is worst-case and uniform over distributions. It tells you: even in the worst dataset, $E_{out}$ is at most $E_{in} + Ω$ . That’s a strong guarantee — and a pessimistic one.

A complementary perspective asks: what is the typical $E_{out}$ when training sets are drawn from a fixed distribution? This is an average over $D$ rather than a worst case, and (with squared error rather than 0–1) admits a clean closed-form decomposition.

The Decomposition

Let $g^{(D)}$ be the hypothesis fit to dataset $D$ , and let

$\overset{g}{ˉ} (x) = E_{D} [g^{(D)} (x)]$

be the average hypothesis — what you’d get by training on infinitely many independently-drawn datasets and averaging. Inserting $\overset{g}{ˉ}$ into the squared error and expanding:

$E_{D} [(g^{(D)} (x) - f (x))^{2}] = var (x) E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x))^{2}] + bias (x) (\overset{g}{ˉ} (x) - f (x))^{2} .$

The cross term vanishes because $E_{D} [g^{(D)} - \overset{g}{ˉ}] = 0$ by definition of $\overset{g}{ˉ}$ . Taking expectation over $x$ as well gives the bias–variance decomposition:

$E_{D} [E_{out} (g^{(D)})] = bias + var .$

With label noise $y = f (x) + ϵ$ , $ϵ \sim N (0, σ^{2})$ , an extra $σ^{2}$ floor appears:

$E_{D, ϵ} [E_{out}] = bias + var + σ^{2} .$

What Bias and Variance Mean

Bias. How far from the truth is the typical hypothesis $H$ produces? Low if $H$ contains hypotheses near $f$ and the algorithm finds them on average. High if $H$ is structurally too simple — even averaging over infinite datasets, you can’t approximate $f$ .
Variance. How much do individual trained hypotheses fluctuate across training sets? Low if the algorithm is stable. High if it’s overfitting random structure in each training set.

The classic dart-board picture: the bullseye is the truth, each dart is a learned hypothesis trained on a different dataset. Bias is the offset of the average dart from the bullseye; variance is the spread.

The Trade-Off, Concretely

Target $f (x) = sin (π x)$ , two samples per training set, two hypothesis sets:

$H_{0} = {h (x) = b}$ — constant lines. Best fit: the average of the two $y$ -values.
$H_{1} = {h (x) = a x + b}$ — sloped lines. Fit: the unique line through the two points.

Compute bias and variance over many random pairs of samples:

Hypothesis set	bias	var	$E_{D} [E_{out}]$
$H_{0}$ (constant)	$0.50$	$0.25$	$0.75$
$H_{1}$ (line)	$0.21$	$1.69$	$1.90$

The constant wins despite having higher bias — it’s so much more stable that the variance saving wipes out the bias penalty.

TIP — Choose model complexity by sample size

The above example reverses if you sample more points. With $N = 5$ :

$H_{0}$ : bias $0.50$ , var $0.10$ , total $0.60$ .

$H_{1}$ : bias $0.21$ , var $0.21$ , total $0.42$ .

The line now wins. Optimal model complexity scales with $N$ . Tiny datasets → simple models; abundant data → complex models. This generalises beyond the toy example: more data shrinks variance, eventually unmasking bias as the dominant error and rewarding flexibility.

VC vs Bias–Variance: Two Lenses

The two analyses describe the same phenomenon from different angles:

	VC analysis	Bias–variance
Loss	0–1	Squared
Over $D$	Worst case	Average
Output	$E_{out} \leq E_{in} + Ω (d_{VC})$	$E_{D} [E_{out}] = bias + var + σ^{2}$
Distribution-free?	Yes	No

VC tells you whether learning generalises uniformly across distributions; bias–variance tells you what kind of error is left, on average, with a specific algorithm. They live on top of the same learning curves — VC shades the gap between $E_{in}$ and $E_{out}$ as the worst-case generalisation gap; bias–variance splits the curves’ contents into bias (the noise floor’s structural component) and variance (the gap above it).

Diagnose: training error 0.05, validation error 0.45, with 1{,}000 examples. High bias or high variance? What would you try?

Big gap between training and validation $\Rightarrow$ high variance. The model is fitting training noise rather than signal. Remedies, in order of difficulty: get more data (the surest fix; reduces variance directly), regularise more aggressively (e.g., increase $λ$ in ridge), simplify the model class (lower polynomial degree, fewer NN parameters), or ensemble multiple models. Don’t add capacity — that would worsen variance.

Learning Curves

Plotting expected $E_{in}$ and $E_{out}$ against $N$ — a learning curve — visualises everything above. Universal qualitative behaviour:

$E_{in}$ rises as $N$ grows (model can no longer interpolate);
$E_{out}$ falls as $N$ grows (more data, better generalisation);
They converge to the same asymptote (the bias-plus-noise floor of the model class).

For linear regression with Gaussian noise, the curves have an exact closed form:

$E_{D} [E_{in}] = σ^{2} (1 - \frac{d + 1}{N}), E_{D} [E_{out}] = σ^{2} (1 + \frac{d + 1}{N}) .$

The gap $2 σ^{2} (d + 1) / N$ decays as $1/ N$ — every parameter “costs” $σ^{2} / N$ of generalisation gap. With $N ≫ d + 1$ , both curves sit at the noise floor $σ^{2}$ .

Empirical learning curves (with held-out validation as a proxy for $E_{out}$ ) are one of the most useful debugging tools in ML — they tell you whether more data, more capacity, or stronger regularisation is the right next step.

Concepts Introduced This Week

dichotomy — labelling pattern that a hypothesis produces on a finite set of inputs; the bridge from infinite $∣ H ∣$ to finite count.
growth-function — $m_{H} (N) = max_{x_{1}, \dots, x_{N}} ∣ H (x_{1}, \dots, x_{N}) ∣$ ; bounded by $2^{N}$ , often polynomial for structured $H$ .
break-point — smallest $k$ for which $m_{H} (k) < 2^{k}$ ; equals $d_{VC} + 1$ .
vc-dimension — largest $N$ for which $H$ shatters some $N$ -point input set; for perceptron in $R^{d}$ , $d_{VC} = d + 1$ . Finite $d_{VC}$ certifies feasibility of learning.
bias-variance-decomposition — $E_{D} [E_{out}] = bias + var (+ σ^{2})$ for squared loss; complements VC’s worst-case analysis.
learning-curve — $E_{D} [E_{in}], E_{D} [E_{out}]$ vs $N$ ; the shape distinguishes underfitting, overfitting, and noise floor.

Connections

Builds on week-08: the generalisation bound $P \leq 2 M e^{- 2 ϵ^{2} N}$ ended with the question “what to do when $M = \infty$ ?” This week answers it. Hoeffding + union bound generalises, with $M$ replaced by $m_{H} (N)$ — the dichotomy-counting generalisation.
Builds on non-linear-transformation: a $Q$ -th order polynomial transform raises the VC dimension to roughly $(d Q + d)$ — a quantitative cost for the non-linearity. Now we have the framework to say why over-aggressive basis expansion overfits.
Builds on support-vector-machine: the margin restricts the hypothesis class to fat hyperplanes, lowering effective $d_{VC}$ to $⌈ R^{2} / ρ^{2} ⌉ + 1$ — independent of input dimension. This is the formal sense in which SVMs “regularise via margin”.
Sets up week-10: overfitting and underfitting examined empirically, plus regularisation and cross-validation as the practical tools to navigate the bias–variance trade-off you can’t measure directly.

Open Questions

Why do deep networks generalise despite enormous $d_{VC}$ ? Modern architectures have millions of parameters trained on hundreds of thousands of examples — VC theory predicts catastrophic overfitting that doesn’t happen. Answers from “implicit regularisation” of SGD, flat-minima theory, and PAC-Bayes bounds are active research.
Does the worst-case nature of VC analysis ever lead practitioners astray? The factor of 1{,}000 between theoretical sample complexity ( $N \approx 10, 000 \cdot d_{VC}$ ) and the practical rule ( $N \approx 10 \cdot d_{VC}$ ) is uncomfortable — if you trusted the bound, you’d give up on learning long before it becomes possible.
How do you actually estimate bias and variance in practice? You can’t from a single training set — bootstrap resampling or cross-validation give approximate estimates. The decomposition is more often used conceptually than measured.
What’s the right way to think about effective capacity for regularised models? Ridge with large $λ$ has the same $d_{VC}$ as unregularised but different effective behaviour. “Effective dimension” or “effective $d_{VC}$ ” formalisations attempt this; we’ll touch on regularisation directly next week.

Course Notes

Explorer

From M to VC Dimension, and Bias–Variance as the Average Case

Part 1: Replacing $M$ with Something Finite

The Problem

Dichotomies and the Growth Function

Examples

Shattering, Break Points, and VC Dimension

The Polynomial Bound

The Perceptron Result

The Sample Complexity Verdict

Margins, Fat Hyperplanes, and Why SVMs Generalise

Part 2: An Average-Case View

Why a Different Decomposition?

The Decomposition

What Bias and Variance Mean

The Trade-Off, Concretely

VC vs Bias–Variance: Two Lenses

Learning Curves

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

From M to VC Dimension, and Bias–Variance as the Average Case

Part 1: Replacing M with Something Finite

The Problem

Dichotomies and the Growth Function

Examples

Shattering, Break Points, and VC Dimension

The Polynomial Bound

The Perceptron Result

The Sample Complexity Verdict

Margins, Fat Hyperplanes, and Why SVMs Generalise

Part 2: An Average-Case View

Why a Different Decomposition?

The Decomposition

What Bias and Variance Mean

The Trade-Off, Concretely

VC vs Bias–Variance: Two Lenses

Learning Curves

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Part 1: Replacing $M$ with Something Finite