The Vapnik–Chervonenkis dimension is the largest such that some set of inputs is shattered by — i.e., . Finite guarantees that learning generalises: the growth function is polynomial of degree , the VC bound contracts as , and the sample complexity scales as in practice.

Definition

Equivalently:

  • For : there exist inputs that can shatter (realise all labellings).
  • For : no set of inputs is shattered. is a break-point for .

The VC dimension is a property of alone — it does not depend on the input distribution , the learning algorithm , or the target function . It captures a hypothesis set’s capacity: the most labellings of any input set it can produce.

Examples

Hypothesis setVC dimensionReason
Thresholds on , Shatters 1 point; cannot shatter 2 (can’t realise if ).
Positive intervals on Shatters 2 points; cannot shatter 3 (the alternating labelling fails).
Linear classifiers in (“perceptron in dimensions”)Shatters points in general position; are too constrained.
Axis-aligned rectangles in Shatters 4 points in a “diamond” arrangement; 5 fails because the interior point cannot be the unique negative one.
Convex sets in Points on a circle: every labelling is realised by the convex hull of the positives.
Neural network with weightsRoughly linear in the parameter count for many architectures.

The Perceptron Theorem

For the perceptron in with bias (i.e., inputs ):

The "" is the bias term . So a 2D perceptron has : it shatters any 3 non-collinear points, but cannot shatter 4 points in general position (the XOR labellings on a square are unrealisable).

The bound matches the number of free parameters of a -dimensional perceptron — slope coefficients plus 1 bias. This is no coincidence: the VC dimension of many “smooth” parameterised hypothesis sets is approximately the number of effective parameters.

The VC Bound

The replacement of by the growth function in the generalisation bound, with a careful argument that handles dependence between training and test sets, gives the VC bound:

Substituting the polynomial bound :

Setting the RHS equal to and solving for gives the high-probability form:

with probability at least .

The penalty term — called — is the model-complexity penalty. It grows with and shrinks with .

Reading the Bound

DirectionEffect
down (more expressive, fits training data better), up (worse generalisation gap)
down (tighter generalisation), up (less able to fit)
Best Some intermediate value where the sum is minimised

This is the formal source of the bias-variance / model-complexity trade-off. Simple models underfit ( large); complex models overfit ( large). The optimum is in between.

Sample Complexity

Inverting the VC bound: for fixed accuracy , confidence , and VC dimension ,

is the number of samples needed. This is implicit in , so iterate to find a self-consistent solution.

Worked example. , , :

  • Plug in : RHS .
  • Iterate with : converges to .
  • For : .

So roughly in theory. Practical rule of thumb: is usually enough — the bound is loose. Real models generalise better than theory promises.

Margin and the SVM

The vanilla VC bound for a -dimensional perceptron gives , which after polynomial basis expansion to degree becomes — large.

SVMs are still linear classifiers, so naively their is . But SVMs add a margin constraint: they restrict to hyperplanes of width at least . The margin restricts the hypothesis set, and a smaller hypothesis set has a smaller VC dimension:

where is the radius of a ball containing all data. This is independent of — the margin gives data-dependent generalisation, and explains why SVMs work even with very high-dimensional kernel feature spaces.

What VC Dimension Doesn’t Capture

The VC bound is the worst-case, distribution-free generalisation bound. It doesn’t account for:

  • Algorithm bias. Different learning algorithms may converge to different hypotheses within the same ; a regulariser like ridge implicitly restricts capacity below .
  • Distribution structure. Real data tends not to be adversarial; effective complexity is often much smaller than .
  • Implicit regularisation. Stochastic gradient descent on neural networks finds flat minima that generalise far better than parameter-counting predicts — a phenomenon outside classical VC theory.

For these reasons, modern deep learning routinely violates the naive prediction “models with should overfit catastrophically.” Networks with millions of parameters trained on hundreds of thousands of examples generalise well — for reasons beyond what VC dimension alone explains.

Active Recall