THE CRUX: Last week's generalisation bound said that learning generalises if the hypothesis set is finite. But every model we actually use — perceptrons, SVMs, neural networks — has infinite , so naively the bound says nothing. (1) What is the right finite quantity to replace with, and how does it lead to a non-vacuous bound? (2) Worst-case bounds give pessimistic guarantees that hold uniformly over all distributions. Is there a complementary, average-case picture of generalisation that's more useful in practice?

The two halves of week 9 each answer one. (1) The fix is to count dichotomies — distinct labellings produces on specific inputs — rather than distinct hypotheses. The maximum dichotomy count, the growth function , is bounded by . The VC dimension is the largest for which — the most points can shatter. Finite implies polynomial , the VC bound contracts as , and learning is feasible. For the perceptron in , — the number of free parameters. (2) Switching from 0–1 loss to squared loss enables an exact decomposition of into bias and variance. Bias is how far the average hypothesis is from the truth ; variance is how much hypotheses fluctuate across training sets. With noise, an additional irreducible term appears. Complex models lower bias but raise variance — the formal trade-off in choosing model complexity, and the average-case companion to VC’s worst-case bound.


Part 1: Replacing with Something Finite

The Problem

The generalisation bound in week 8 ended with a problem: it depends on . For finite hypothesis sets the bound is

— useful and informative. But every interesting model has : linear classifiers in have a continuum of weight vectors, so do SVMs and neural networks. Plugging in gives a vacuous bound that says nothing. Are infinite hypothesis sets simply doomed?

The answer is no, and the fix is conceptually beautiful: most of the in the union bound is double-counting. Two hypotheses that label the training set identically have almost identical “bad events” — the events that their training and test errors disagree by more than . Treating them as rather than wastes half the budget.

Dichotomies and the Growth Function

The right quantity is what hypotheses look like on a fixed training set, not what they look like as full functions on . A dichotomy is a labelling pattern that some produces on specific inputs. Two hypotheses with different parameters but identical labellings on the training set are the same dichotomy.

The number of distinct dichotomies on specific inputs is at most — finite even when is infinite. To remove dependence on the specific input choice, take the maximum over all input sets:

This is the growth function — a property of alone, distribution-free, algorithm-independent, and bounded by .

Examples

Let’s compute for some hypothesis sets:

Positive rays on
Positive intervals iff
2D perceptrons (lines in ), with
Convex sets in (always)

The first three grow polynomially. The last is exponential — and it stays at forever, no matter how large gets.

Shattering, Break Points, and VC Dimension

If , we say shatters some -point input set: it can produce every conceivable labelling of those points. A break-point is the smallest at which shattering fails, . Once has any break point, every larger inherits it.

The VC dimension is the largest at which can still shatter — equivalently, one less than the break point.

ASIDE — Why "break point" is the right name

Imagine sliding from 1 upward. Initially keeps up with the explosion of binary labellings (). At some specific — the break point — the hypothesis set “breaks”, unable to keep producing every labelling. Once it breaks, it stays broken: the structural restriction propagates. The break point marks the moment the hypothesis set’s geometry kicks in.

For 2D perceptrons: (any 3 non-collinear points are shattered) and (the XOR labelling on a square is impossible). Break point 4, .

The Polynomial Bound

The deep fact: a single break point forces to grow polynomially. Concretely (Sauer–Shelah):

This is the moment the entire learning-theory story closes. We started with an exponential that ruined the bound; we end with a polynomial . The VC bound becomes

and with polynomial the right-hand side as . Finite VC dimension ⇒ learning generalises.

The Perceptron Result

For the perceptron in with bias (inputs ):

The "" is the bias term. So the VC dimension equals the number of free parameters — slope coefficients plus 1 intercept. For “smooth” parameterised hypothesis sets, this rough equivalence (capacity ≈ parameter count) holds approximately.

The Sample Complexity Verdict

Inverting the VC bound: . This is implicit in , so iterate.

For , , : converges to . For : . The pattern: in theory.

In practice, the bound is loose. Rule of thumb: is usually enough. Real distributions are far more benign than worst-case theory presumes — the bound’s distribution-free guarantee comes at a heavy quantitative price.

Margins, Fat Hyperplanes, and Why SVMs Generalise

A perceptron’s scales with the input dimension, which becomes large after polynomial basis expansion (dimension for degree ). Naively, kernelised SVMs have astronomical — yet they generalise well. Why?

Because the SVM doesn’t use all hyperplanes — only those with margin at least . Restricting to fat hyperplanes shrinks the hypothesis set, which shrinks the VC dimension:

where is the radius of a ball containing the data. This is independent of — the kernel feature space can have infinite dimension without harming generalisation. The SVM’s geometric criterion (maximise the margin) is exactly the criterion that controls capacity. This is the formal version of “SVMs work because of the margin.”

Part 2: An Average-Case View

Why a Different Decomposition?

The VC bound is worst-case and uniform over distributions. It tells you: even in the worst dataset, is at most . That’s a strong guarantee — and a pessimistic one.

A complementary perspective asks: what is the typical when training sets are drawn from a fixed distribution? This is an average over rather than a worst case, and (with squared error rather than 0–1) admits a clean closed-form decomposition.

The Decomposition

Let be the hypothesis fit to dataset , and let

be the average hypothesis — what you’d get by training on infinitely many independently-drawn datasets and averaging. Inserting into the squared error and expanding:

The cross term vanishes because by definition of . Taking expectation over as well gives the bias–variance decomposition:

With label noise , , an extra floor appears:

What Bias and Variance Mean

  • Bias. How far from the truth is the typical hypothesis produces? Low if contains hypotheses near and the algorithm finds them on average. High if is structurally too simple — even averaging over infinite datasets, you can’t approximate .
  • Variance. How much do individual trained hypotheses fluctuate across training sets? Low if the algorithm is stable. High if it’s overfitting random structure in each training set.

The classic dart-board picture: the bullseye is the truth, each dart is a learned hypothesis trained on a different dataset. Bias is the offset of the average dart from the bullseye; variance is the spread.

The Trade-Off, Concretely

Target , two samples per training set, two hypothesis sets:

  • — constant lines. Best fit: the average of the two -values.
  • — sloped lines. Fit: the unique line through the two points.

Compute bias and variance over many random pairs of samples:

Hypothesis setbiasvar
(constant)
(line)

The constant wins despite having higher bias — it’s so much more stable that the variance saving wipes out the bias penalty.

TIP — Choose model complexity by sample size

The above example reverses if you sample more points. With :

  • : bias , var , total .
  • : bias , var , total .

The line now wins. Optimal model complexity scales with . Tiny datasets → simple models; abundant data → complex models. This generalises beyond the toy example: more data shrinks variance, eventually unmasking bias as the dominant error and rewarding flexibility.

VC vs Bias–Variance: Two Lenses

The two analyses describe the same phenomenon from different angles:

VC analysisBias–variance
Loss0–1Squared
Over Worst caseAverage
Output
Distribution-free?YesNo

VC tells you whether learning generalises uniformly across distributions; bias–variance tells you what kind of error is left, on average, with a specific algorithm. They live on top of the same learning curves — VC shades the gap between and as the worst-case generalisation gap; bias–variance splits the curves’ contents into bias (the noise floor’s structural component) and variance (the gap above it).

Learning Curves

Plotting expected and against — a learning curve — visualises everything above. Universal qualitative behaviour:

  • rises as grows (model can no longer interpolate);
  • falls as grows (more data, better generalisation);
  • They converge to the same asymptote (the bias-plus-noise floor of the model class).

For linear regression with Gaussian noise, the curves have an exact closed form:

The gap decays as — every parameter “costs” of generalisation gap. With , both curves sit at the noise floor .

Empirical learning curves (with held-out validation as a proxy for ) are one of the most useful debugging tools in ML — they tell you whether more data, more capacity, or stronger regularisation is the right next step.


Concepts Introduced This Week

  • dichotomy — labelling pattern that a hypothesis produces on a finite set of inputs; the bridge from infinite to finite count.
  • growth-function; bounded by , often polynomial for structured .
  • break-point — smallest for which ; equals .
  • vc-dimension — largest for which shatters some -point input set; for perceptron in , . Finite certifies feasibility of learning.
  • bias-variance-decomposition for squared loss; complements VC’s worst-case analysis.
  • learning-curve vs ; the shape distinguishes underfitting, overfitting, and noise floor.

Connections

  • Builds on week-08: the generalisation bound ended with the question “what to do when ?” This week answers it. Hoeffding + union bound generalises, with replaced by — the dichotomy-counting generalisation.
  • Builds on non-linear-transformation: a -th order polynomial transform raises the VC dimension to roughly — a quantitative cost for the non-linearity. Now we have the framework to say why over-aggressive basis expansion overfits.
  • Builds on support-vector-machine: the margin restricts the hypothesis class to fat hyperplanes, lowering effective to — independent of input dimension. This is the formal sense in which SVMs “regularise via margin”.
  • Sets up week-10: overfitting and underfitting examined empirically, plus regularisation and cross-validation as the practical tools to navigate the bias–variance trade-off you can’t measure directly.

Open Questions

  • Why do deep networks generalise despite enormous ? Modern architectures have millions of parameters trained on hundreds of thousands of examples — VC theory predicts catastrophic overfitting that doesn’t happen. Answers from “implicit regularisation” of SGD, flat-minima theory, and PAC-Bayes bounds are active research.
  • Does the worst-case nature of VC analysis ever lead practitioners astray? The factor of 1{,}000 between theoretical sample complexity () and the practical rule () is uncomfortable — if you trusted the bound, you’d give up on learning long before it becomes possible.
  • How do you actually estimate bias and variance in practice? You can’t from a single training set — bootstrap resampling or cross-validation give approximate estimates. The decomposition is more often used conceptually than measured.
  • What’s the right way to think about effective capacity for regularised models? Ridge with large has the same as unregularised but different effective behaviour. “Effective dimension” or “effective ” formalisations attempt this; we’ll touch on regularisation directly next week.