Given a hypothesis and inputs , the dichotomy generated by on those inputs is the tuple of labels . The set of all dichotomies replaces the infinite cardinality with a finite count of at most .
Definition
A hypothesis maps every point in to . The dichotomy of on a fixed set of inputs is what looks like restricted to those inputs:
Two different hypotheses can produce the same dichotomy: any two lines in that label the same training points the same way are the same dichotomy even if they’re different decision boundaries. The dichotomy is a coarser equivalence class than the hypothesis itself.
The set of all dichotomies that can implement on these inputs is
This set has size at most — the total number of binary labellings of points — and is finite even when .
Why It Matters
The generalisation bound from a union bound is
where . For continuously-parameterised hypotheses (perceptrons, SVMs, neural networks), and the bound is vacuous.
The fix: most of the in the union bound is wasteful. If two hypotheses produce the same labelling on the training set, their bad events overlap massively — in the union bound we’re double-counting them.
Counting distinct dichotomies rather than distinct hypotheses gives a much smaller, finite quantity, and (with a more careful argument than the simple union bound) replaces in the generalisation bound. The maximum number of dichotomies over all possible input choices is the growth function .
A Concrete Count
Take all lines in (i.e., 2D perceptrons). Place inputs in general position. How many of the possible labellings can lines actually realise?
All 8: for any binary labelling of three non-collinear points, there’s a line that puts the points on one side and on the other. So .
Now place points. Of the labellings, lines can realise only 14. The two unrealisable ones are the “XOR” labellings: and on a square — no line separates them. The infinite hypothesis set “all lines in ” produces only 14 dichotomies on 4 points.
| All possible labellings | Lines in realise | |
|---|---|---|
| 1 | 2 | 2 |
| 2 | 4 | 4 |
| 3 | 8 | 8 |
| 4 | 16 | 14 |
The structural restriction begins at . This is the break-point behaviour that ultimately makes polynomial rather than exponential, and learning with infinite- feasible.
Shattering
When produces all possible dichotomies on a particular set of inputs, we say shatters those inputs. Lines in shatter any 3 non-collinear points, but cannot shatter any 4 points in general position. Convex sets in , on the other hand, shatter any number of points placed on a circle — so the dichotomy count stays at for all .
Shattering is the strongest possible expressivity claim: there is no labelling we can’t fit. The largest for which can shatter some set of inputs is the VC dimension of .
Related
- growth-function — over all input choices; replaces in the generalisation bound.
- vc-dimension — the largest for which some set of inputs is shattered.
- break-point — smallest at which shattering fails for every input set.
- generalization-bound — what the dichotomy-counting trick rescues.
Active Recall
Why isn't the size of always equal to , even when is infinite?
Because many distinct hypotheses produce the same labelling on a fixed finite input set. Two slightly-rotated lines in that happen to put the same training points on the same sides are different hypotheses but the same dichotomy. The dichotomy count collapses the equivalence class “labels the training set identically” — capped at regardless of how many distinct hypotheses exist.
Place 4 points at the corners of a square in . Which two of the 16 binary labellings can no linear classifier (perceptron in 2D) realise, and why?
The XOR-style labellings: on diagonally opposite corners, and the complement . A line in partitions the plane into two half-planes; to put diagonally-opposite corners on the same side and the other two opposite corners on the other side would require a non-linear boundary. So , which is the source of the perceptron’s break point at 4.
Why does counting dichotomies rather than hypotheses help us bound the probability of bad generalisation?
The naive union bound multiplies the per-hypothesis bad-event probability by , which is infinite for any continuously parameterised model. But “bad event for ” and “bad event for ” almost coincide when and produce the same labelling on the training set. A finer accounting collapses each equivalence class of identically-behaving hypotheses into a single dichotomy. There are at most dichotomies, and for “structured” far fewer — often polynomial in . The corrected union bound uses this dichotomy count rather than .