The growth function is the largest number of distinct dichotomies that can produce on any choice of inputs. Bounded by in the worst case, often polynomial in for structured hypothesis sets — and the quantity that replaces in the generalisation bound when .
Definition
For a hypothesis set of binary classifiers on , define
where is the number of distinct labellings can produce on those specific inputs.
Two key properties:
- — there are only binary labellings of points.
- depends only on and . It does not depend on the input distribution , the learning algorithm, or the target function.
The “max over inputs” eliminates dependence on the specific training set. We’re asking: with the most adversarial possible placement of points, how many distinct labellings can realise?
Worked Examples
Positive Rays
— threshold functions on the real line.
Place points on , sorted: . Sliding the threshold from to , the labelling pattern changes only when crosses one of the points. There are “regions” for , each producing a distinct dichotomy:
The "" comes from the flip points; the "" from the two ends.
Positive Intervals
.
Place points on . A dichotomy is determined by which two of the “gaps between/around points” the endpoints and fall into. Choosing 2 gaps from with repetition gives
The extra "" is the all-negative labelling (interval entirely outside the points).
2D Perceptrons
all linear classifiers in . By direct enumeration:
| 1 | 2 | 2 |
| 2 | 4 | 4 |
| 3 | 8 | 8 |
| 4 | 14 | 16 |
At the count drops below for the first time. With more careful analysis for 2D perceptrons — polynomial, not exponential.
Convex Sets in
.
Place inputs on a circle. For any labelling, take the convex hull of the points (slightly extended): every dichotomy is realisable. So
Convex sets are maximally expressive on circles — they shatter any . The hypothesis set has no break-point and infinite VC dimension.
The Break Point
A break point for is the smallest such that no set of inputs is shattered — i.e.,
If is a break point, every is also a break point: if you can’t shatter inputs, you can’t shatter either (a counter-example to shattering on points sits inside any superset).
Break points for the examples above:
| Growth function | Break point | |
|---|---|---|
| Positive rays | () | |
| Positive intervals | () | |
| 2D perceptrons | () | |
| Convex sets in | None |
The Polynomial Bound
The deep result of VC theory: if has any break point , then is bounded by a polynomial in . Concretely, with (the VC dimension):
The growth from exponential () to polynomial () is what makes the generalisation bound non-vacuous for infinite hypothesis sets. Plug a polynomial into
(the VC bound), and the right-hand side as — learning is feasible.
Without a break point (, like convex sets), and the bound is vacuous: the hypothesis set is too expressive to generalise from finite data.
Why Max Over Inputs?
The growth function takes the maximum over all input arrangements, not the typical case. Three reasons:
- Worst-case bound. The generalisation bound must hold uniformly over all distributions — no probabilistic assumption rules out adversarial input placement.
- Algorithm- and distribution-free. Removing the dependence on makes a property of alone, computable without reference to data.
- Structural complexity, not statistical. measures what is capable of expressing, divorced from what data we happen to see.
Related
- dichotomy — the basic object being counted.
- vc-dimension — the largest for which .
- break-point — the smallest at which .
- generalization-bound — what the polynomial growth function rescues for infinite .
Active Recall
Compute for positive rays and explain where the "" in comes from.
Sort the inputs along . The threshold creates a distinct dichotomy depending on which “gap” between consecutive inputs (or below the smallest, or above the largest) it sits in. There are such gaps, hence . The "" corresponds to either of the two extreme gaps where the threshold is below all points (all ) or above all (all ) — collapsed into one because both extremes give labellings that are realisable, but the count of slid-into regions is .
Why is the growth function for convex sets in exactly for all , while for 2D perceptrons it drops below at ?
Convex sets are far more expressive than half-planes. Place points on a circle: for any binary labelling, take the convex hull of the points — that’s a convex region containing exactly the points and no others. Every labelling is realisable, so . Lines in partition the plane into two half-spaces; XOR-style labellings on 4 corners of a square cannot be realised because the points and points each form non-collinear pairs that would require a non-convex separator. The structural restriction “decision boundary is a hyperplane” begins to bite at .
Suppose . What is the smallest that might be a break point for , and why is "might" the right word?
The smallest possible break point is . We’ve shown , so is at least a break point — but maybe shattering already fails for fewer points. We’d need to check etc. against to find the smallest . Concretely, the break point is the smallest for which ; once you find it, all larger values are break points too.
Why does the max-over-inputs definition of make the resulting generalisation bound worst-case, and what does that imply about how we should interpret it in practice?
The max ensures the count holds regardless of how training inputs are drawn — even if an adversary placed them. Real data is typically not adversarial, so empirical generalisation is often much better than the bound predicts. The bound is a sufficient condition for generalisation, not an estimate of typical-case behaviour. In practice, this is why we still rely on cross-validation for model selection: the worst-case bound is too conservative to drive practical decisions for complex models.