The Vapnik–Chervonenkis dimension is the largest such that some set of inputs is shattered by — i.e., . Finite guarantees that learning generalises: the growth function is polynomial of degree , the VC bound contracts as , and the sample complexity scales as in practice.
Definition
Equivalently:
- For : there exist inputs that can shatter (realise all labellings).
- For : no set of inputs is shattered. is a break-point for .
The VC dimension is a property of alone — it does not depend on the input distribution , the learning algorithm , or the target function . It captures a hypothesis set’s capacity: the most labellings of any input set it can produce.
Examples
| Hypothesis set | VC dimension | Reason |
|---|---|---|
| Thresholds on , | Shatters 1 point; cannot shatter 2 (can’t realise if ). | |
| Positive intervals on | Shatters 2 points; cannot shatter 3 (the alternating labelling fails). | |
| Linear classifiers in (“perceptron in dimensions”) | Shatters points in general position; are too constrained. | |
| Axis-aligned rectangles in | Shatters 4 points in a “diamond” arrangement; 5 fails because the interior point cannot be the unique negative one. | |
| Convex sets in | Points on a circle: every labelling is realised by the convex hull of the positives. | |
| Neural network with weights | Roughly linear in the parameter count for many architectures. |
The Perceptron Theorem
For the perceptron in with bias (i.e., inputs ):
The "" is the bias term . So a 2D perceptron has : it shatters any 3 non-collinear points, but cannot shatter 4 points in general position (the XOR labellings on a square are unrealisable).
The bound matches the number of free parameters of a -dimensional perceptron — slope coefficients plus 1 bias. This is no coincidence: the VC dimension of many “smooth” parameterised hypothesis sets is approximately the number of effective parameters.
The VC Bound
The replacement of by the growth function in the generalisation bound, with a careful argument that handles dependence between training and test sets, gives the VC bound:
Substituting the polynomial bound :
Setting the RHS equal to and solving for gives the high-probability form:
with probability at least .
The penalty term — called — is the model-complexity penalty. It grows with and shrinks with .
Reading the Bound
| Direction | Effect |
|---|---|
| down (more expressive, fits training data better), up (worse generalisation gap) | |
| down (tighter generalisation), up (less able to fit) | |
| Best | Some intermediate value where the sum is minimised |
This is the formal source of the bias-variance / model-complexity trade-off. Simple models underfit ( large); complex models overfit ( large). The optimum is in between.
Sample Complexity
Inverting the VC bound: for fixed accuracy , confidence , and VC dimension ,
is the number of samples needed. This is implicit in , so iterate to find a self-consistent solution.
Worked example. , , :
- Plug in : RHS .
- Iterate with : converges to .
- For : .
So roughly in theory. Practical rule of thumb: is usually enough — the bound is loose. Real models generalise better than theory promises.
Margin and the SVM
The vanilla VC bound for a -dimensional perceptron gives , which after polynomial basis expansion to degree becomes — large.
SVMs are still linear classifiers, so naively their is . But SVMs add a margin constraint: they restrict to hyperplanes of width at least . The margin restricts the hypothesis set, and a smaller hypothesis set has a smaller VC dimension:
where is the radius of a ball containing all data. This is independent of — the margin gives data-dependent generalisation, and explains why SVMs work even with very high-dimensional kernel feature spaces.
What VC Dimension Doesn’t Capture
The VC bound is the worst-case, distribution-free generalisation bound. It doesn’t account for:
- Algorithm bias. Different learning algorithms may converge to different hypotheses within the same ; a regulariser like ridge implicitly restricts capacity below .
- Distribution structure. Real data tends not to be adversarial; effective complexity is often much smaller than .
- Implicit regularisation. Stochastic gradient descent on neural networks finds flat minima that generalise far better than parameter-counting predicts — a phenomenon outside classical VC theory.
For these reasons, modern deep learning routinely violates the naive prediction “models with should overfit catastrophically.” Networks with millions of parameters trained on hundreds of thousands of examples generalise well — for reasons beyond what VC dimension alone explains.
Related
- dichotomy — the labelling pattern that VC dimension counts.
- growth-function — , with but .
- break-point — equals .
- generalization-bound — the simpler -based bound that VC dimension generalises.
- bias-variance-decomposition — the average-case alternative to VC’s worst-case analysis.
- support-vector-machine — SVMs achieve generalisation by margin-based restriction of effective .
Active Recall
Show that the VC dimension of a perceptron in (with bias) is 3 by demonstrating both a set of 3 points it shatters and a set of 4 points it cannot.
Shattering 3 points. Place 3 non-collinear points as the vertices of a triangle. For any of the labellings, draw a line that puts the vertices on one side and on the other — possible because the points aren’t collinear. So . Failing on 4 points. Place 4 points at the corners of a square. The labelling on the diagonally opposite corners (XOR pattern) cannot be realised by any line: a line has a single linear separator, but the corners and corners are interleaved diagonally. So , and .
The VC dimension of a perceptron in is . Where does the "" come from, and what does this say about the relationship between VC dimension and parameter count?
The "" is the bias term . With inputs augmented as , the perceptron has free parameters , each contributing a degree of freedom to the decision boundary. For perceptrons specifically, equals the number of free parameters exactly. For other smooth parameterised models — neural networks especially — VC dimension is approximately linear in the parameter count, , but not always equal. Parameter counting is a useful heuristic, not a precise rule.
Why does using a polynomial basis expansion of degree on a -dimensional perceptron increase the VC dimension to , and what's the practical consequence?
A degree- polynomial basis has non-bias features (all monomials up to degree ). The classifier is a perceptron in this -dimensional feature space, so . For : . The model-complexity penalty in the VC bound grows with this large , demanding hugely more samples for the same generalisation gap. Practically: high-degree polynomial expansions overfit easily and require either heavy regularisation or massive data.
Why is the SVM's bound — depending on data geometry through and — and how does this differ from the perceptron's ?
The plain perceptron bound depends only on the input dimension; it ignores the data. The SVM constrains the hypothesis class to fat hyperplanes of margin at least , restricting the classifier from labelling closely-spaced points arbitrarily. With data confined to a ball of radius , an "" bound emerges from a packing argument: only so many “fat” hyperplanes can fit before they must overlap. The bound has a flavor of data-dependent capacity: large margin (relative to data scale) implies low effective capacity. This is the formal expression of “SVMs generalise because of the margin” and is independent of input dimension — the kernel trick can blow up to infinity without spoiling generalisation.
A model has . Roughly how many training samples does the practical rule of thumb suggest you need for good generalisation, and how does this compare to the theoretical VC sample complexity?
Practical rule: samples. The theoretical VC bound says you need — twenty times more. The huge gap reflects that the VC bound is worst-case and distribution-free; real distributions are far more benign, so empirical generalisation kicks in much earlier than theory promises.