vc-dimension

The Vapnik–Chervonenkis dimension $d_{VC} (H)$ is the largest $N$ such that some set of $N$ inputs is shattered by $H$ — i.e., $m_{H} (N) = 2^{N}$ . Finite $d_{VC}$ guarantees that learning generalises: the growth function is polynomial of degree $d_{VC}$ , the VC bound contracts as $N \to \infty$ , and the sample complexity scales as $N \approx 10 d_{VC}$ in practice.

Definition

$d_{VC} (H) = max {N : m_{H} (N) = 2^{N}} .$

Equivalently:

For $N \leq d_{VC}$ : there exist $N$ inputs that $H$ can shatter (realise all $2^{N}$ labellings).
For $N > d_{VC}$ : no set of $N$ inputs is shattered. $N$ is a break-point for $H$ .

The VC dimension is a property of $H$ alone — it does not depend on the input distribution $p (x)$ , the learning algorithm $A$ , or the target function $f$ . It captures a hypothesis set’s capacity: the most labellings of any input set it can produce.

Examples

Hypothesis set	VC dimension	Reason
Thresholds on $R$ , $h (x) = sign (x - a)$	$1$	Shatters 1 point; cannot shatter 2 (can’t realise $-, +$ if $x_{1} < x_{2}$ ).
Positive intervals on $R$	$2$	Shatters 2 points; cannot shatter 3 (the alternating $-, +, -$ labelling fails).
Linear classifiers in $R^{d}$ (“perceptron in $d$ dimensions”)	$d + 1$	Shatters $d + 1$ points in general position; $d + 2$ are too constrained.
Axis-aligned rectangles in $R^{2}$	$4$	Shatters 4 points in a “diamond” arrangement; 5 fails because the interior point cannot be the unique negative one.
Convex sets in $R^{2}$	$\infty$	Points on a circle: every labelling is realised by the convex hull of the positives.
Neural network with $W$ weights	$Θ (W)$	Roughly linear in the parameter count for many architectures.

The Perceptron Theorem

For the perceptron in $R^{d}$ with bias (i.e., inputs $x = (1, x_{1}, \dots, x_{d})^{⊤}$ ):

$d_{VC} (perceptron in R^{d}) = d + 1.$

The " $+ 1$ " is the bias term $w_{0}$ . So a 2D perceptron has $d_{VC} = 3$ : it shatters any 3 non-collinear points, but cannot shatter 4 points in general position (the XOR labellings on a square are unrealisable).

The bound $d_{VC} = d + 1$ matches the number of free parameters of a $d$ -dimensional perceptron — $d$ slope coefficients plus 1 bias. This is no coincidence: the VC dimension of many “smooth” parameterised hypothesis sets is approximately the number of effective parameters.

The VC Bound

The replacement of $M$ by the growth function in the generalisation bound, with a careful argument that handles dependence between training and test sets, gives the VC bound:

$P [∣ E_{in} (g) - E_{out} (g) ∣ > ϵ] \leq 4 m_{H} (2 N) e^{- \frac{1}{8} ϵ^{2} N} .$

Substituting the polynomial bound $m_{H} (N) \leq N^{d_{VC}} + 1$ :

$P [∣ E_{in} (g) - E_{out} (g) ∣ > ϵ] \leq 4 (2 N)^{d_{VC}} e^{- \frac{1}{8} ϵ^{2} N} .$

Setting the RHS equal to $δ$ and solving for $ϵ$ gives the high-probability form:

$E_{out} (g) \leq E_{in} (g) + \frac{8}{N} lo g \frac{4 (( 2 N ) ^{d_{VC}} + 1 )}{δ}$

with probability at least $1 - δ$ .

The penalty term $\frac{8}{N} lo g \frac{4 (( 2 N ) ^{d_{VC}} + 1 )}{δ}$ — called $Ω (N, H, δ)$ — is the model-complexity penalty. It grows with $d_{VC}$ and shrinks with $N$ .

Reading the Bound

Direction	Effect
$d_{VC} ↑$	$E_{in}$ down (more expressive, fits training data better), $Ω$ up (worse generalisation gap)
$d_{VC} ↓$	$Ω$ down (tighter generalisation), $E_{in}$ up (less able to fit)
Best $d_{VC}^{*}$	Some intermediate value where the sum $E_{in} + Ω$ is minimised

This is the formal source of the bias-variance / model-complexity trade-off. Simple models underfit ( $E_{in}$ large); complex models overfit ( $Ω$ large). The optimum is in between.

Sample Complexity

Inverting the VC bound: for fixed accuracy $ϵ$ , confidence $1 - δ$ , and VC dimension $d_{VC}$ ,

$N \geq \frac{8}{ϵ ^{2}} lo g \frac{4 (( 2 N ) ^{d_{VC}} + 1 )}{δ}$

is the number of samples needed. This is implicit in $N$ , so iterate to find a self-consistent solution.

Worked example. $d_{VC} = 3$ , $ϵ = 0.1$ , $δ = 0.1$ :

Plug in $N = 1000$ : RHS $\approx 8/0.01 \cdot lo g (4 \cdot 8 \cdot 1 0^{9} /0.1) \approx 21, 193$ .
Iterate with $N = 21, 193$ : converges to $N \approx 30, 000$ .
For $d_{VC} = 4$ : $N \approx 40, 000$ .

So roughly $N \approx 10, 000 \cdot d_{VC}$ in theory. Practical rule of thumb: $N \approx 10 \cdot d_{VC}$ is usually enough — the bound is loose. Real models generalise better than theory promises.

Margin and the SVM

The vanilla VC bound for a $d$ -dimensional perceptron gives $d_{VC} = d + 1$ , which after polynomial basis expansion to degree $Q$ becomes $(d Q + d) = O (Q^{d})$ — large.

SVMs are still linear classifiers, so naively their $d_{VC}$ is $d + 1$ . But SVMs add a margin constraint: they restrict to hyperplanes of width at least $ρ$ . The margin restricts the hypothesis set, and a smaller hypothesis set has a smaller VC dimension:

$d_{VC} (ρ) \leq ⌈ \frac{R ^{2}}{ρ ^{2}} ⌉ + 1$

where $R$ is the radius of a ball containing all data. This is independent of $d$ — the margin gives data-dependent generalisation, and explains why SVMs work even with very high-dimensional kernel feature spaces.

What VC Dimension Doesn’t Capture

The VC bound is the worst-case, distribution-free generalisation bound. It doesn’t account for:

Algorithm bias. Different learning algorithms may converge to different hypotheses within the same $H$ ; a regulariser like ridge implicitly restricts capacity below $d_{VC}$ .
Distribution structure. Real data tends not to be adversarial; effective complexity is often much smaller than $d_{VC}$ .
Implicit regularisation. Stochastic gradient descent on neural networks finds flat minima that generalise far better than parameter-counting predicts — a phenomenon outside classical VC theory.

For these reasons, modern deep learning routinely violates the naive prediction “models with $d_{VC} ≫ N$ should overfit catastrophically.” Networks with millions of parameters trained on hundreds of thousands of examples generalise well — for reasons beyond what VC dimension alone explains.

dichotomy — the labelling pattern that VC dimension counts.
growth-function — $m_{H} (N)$ , with $m_{H} (d_{VC}) = 2^{d_{VC}}$ but $m_{H} (d_{VC} + 1) < 2^{d_{VC} + 1}$ .
break-point — equals $d_{VC} + 1$ .
generalization-bound — the simpler $M$ -based bound that VC dimension generalises.
bias-variance-decomposition — the average-case alternative to VC’s worst-case analysis.
support-vector-machine — SVMs achieve generalisation by margin-based restriction of effective $d_{VC}$ .

Active Recall

Show that the VC dimension of a perceptron in $R^{2}$ (with bias) is 3 by demonstrating both a set of 3 points it shatters and a set of 4 points it cannot.

Shattering 3 points. Place 3 non-collinear points as the vertices of a triangle. For any of the $2^{3} = 8$ labellings, draw a line that puts the $+ 1$ vertices on one side and $- 1$ on the other — possible because the points aren’t collinear. So $m_{H} (3) = 8 = 2^{3}$ . Failing on 4 points. Place 4 points at the corners of a square. The labelling $(+, -, -, +)$ on the diagonally opposite corners (XOR pattern) cannot be realised by any line: a line has a single linear separator, but the $+$ corners and $-$ corners are interleaved diagonally. So $m_{H} (4) \leq 14 < 16$ , and $d_{VC} = 3$ .

The VC dimension of a perceptron in $R^{d}$ is $d + 1$ . Where does the " $+ 1$ " come from, and what does this say about the relationship between VC dimension and parameter count?

The " $+ 1$ " is the bias term $w_{0}$ . With inputs augmented as $x = (1, x_{1}, \dots, x_{d})^{⊤}$ , the perceptron has $d + 1$ free parameters $(w_{0}, w_{1}, \dots, w_{d})$ , each contributing a degree of freedom to the decision boundary. For perceptrons specifically, $d_{VC}$ equals the number of free parameters exactly. For other smooth parameterised models — neural networks especially — VC dimension is approximately linear in the parameter count, $d_{VC} \approx Θ (W)$ , but not always equal. Parameter counting is a useful heuristic, not a precise rule.

Why does using a polynomial basis expansion of degree $Q$ on a $d$ -dimensional perceptron increase the VC dimension to $(d Q + d)$ , and what's the practical consequence?

A degree- $Q$ polynomial basis has $\tilde{d} = (d Q + d) - 1$ non-bias features (all monomials up to degree $Q$ ). The classifier is a perceptron in this $\tilde{d}$ -dimensional feature space, so $d_{VC} \leq \tilde{d} + 1 = (d Q + d)$ . For $d = 100, Q = 3$ : $(3 103) \approx 176, 851$ . The model-complexity penalty in the VC bound grows with this large $d_{VC}$ , demanding hugely more samples for the same generalisation gap. Practically: high-degree polynomial expansions overfit easily and require either heavy regularisation or massive data.

Why is the SVM's $d_{VC}$ bound $⌈ R^{2} / ρ^{2} ⌉ + 1$ — depending on data geometry through $R$ and $ρ$ — and how does this differ from the perceptron's $d + 1$ ?

The plain perceptron bound $d + 1$ depends only on the input dimension; it ignores the data. The SVM constrains the hypothesis class to fat hyperplanes of margin at least $ρ$ , restricting the classifier from labelling closely-spaced points arbitrarily. With data confined to a ball of radius $R$ , an " $R^{2} / ρ^{2}$ " bound emerges from a packing argument: only so many “fat” hyperplanes can fit before they must overlap. The bound has a flavor of data-dependent capacity: large margin (relative to data scale) implies low effective capacity. This is the formal expression of “SVMs generalise because of the margin” and is independent of input dimension — the kernel trick can blow $d$ up to infinity without spoiling generalisation.

A model has $d_{VC} = 50$ . Roughly how many training samples does the practical rule of thumb suggest you need for good generalisation, and how does this compare to the theoretical VC sample complexity?

Practical rule: $N \approx 10 \cdot d_{VC} = 500$ samples. The theoretical VC bound says you need $N \approx 10, 000 \cdot d_{VC} = 500, 000$ — twenty times more. The huge gap reflects that the VC bound is worst-case and distribution-free; real distributions are far more benign, so empirical generalisation kicks in much earlier than theory promises.

Course Notes

Explorer

vc-dimension

Definition

Examples

The Perceptron Theorem

The VC Bound

Reading the Bound

Sample Complexity

Margin and the SVM

What VC Dimension Doesn’t Capture

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

vc-dimension

Definition

Examples

The Perceptron Theorem

The VC Bound

Reading the Bound

Sample Complexity

Margin and the SVM

What VC Dimension Doesn’t Capture

Related

Active Recall

Graph View

Table of Contents

Backlinks