THE CRUX: Last week's regularisation gave us a family of trained models indexed by — but how do we actually pick the right ? More broadly: with the regularisation parameter, model class, kernel, and other knobs all unknown, how do we choose between candidate models without the test set we promised never to touch? And once we've answered that — what general principles, beyond specific algorithms, govern whether a learning experiment was honest in the first place?
The two halves of week 11 each answer one. (1) Validation holds out a portion of the training data, trains on the rest, and uses the held-out part to estimate for each candidate model. The validation error is an unbiased estimate of with variance for binary classification — large shrinks variance but starves the trained model. The standard rule splits the difference. After validation picks the winner, retrain on all examples to produce the final hypothesis. Cross-validation generalises validation to use all the data: leave-one-out trains times on examples each, -fold partitions into chunks ( is the practical default). (2) Three meta-principles wrap up the module’s theory. Occam’s Razor: simpler is better, where “simpler” means small VC dimension and “better” means lower . Sampling Bias: training data from the wrong distribution → no theorem rescues you. Data Snooping: any influence of the test set on any decision contaminates it. Together they’re the discipline that turns the module’s machinery into honest learning.
Part 1: How Do You Actually Pick ?
The Model Selection Problem
Last week we built regularisation: minimise for some choice of . But “for some choice of ” is doing heavy lifting. The optimal depends on the noise level , target complexity , and sample size — all of which we don’t observe.
More generally: we have candidate models — different hypothesis classes (linear, quadratic, kernel SVM, …), different regularisers, different values, different feature transformations. How do we pick?
Three obvious-looking choices, all wrong:
- Pick by smallest : always selects the most expressive class, since a richer can always fit training data better. This is overfitting through the back door — minimising over pays the VC cost of the union, and the most complex member always wins.
- Pick by smallest : would give an unbiased generalisation guarantee with error, but is locked in the boss’s safe — using it for selection would be infeasible and cheating.
- Pick by intuition / cross-paper folklore: invisible bias accumulates; not a procedure.
Validation: A Held-Out Slice of Training Data
The fix is structural: split into a training set ( examples) and a validation set ( examples). Train on to get . Then compute
Three error types now sit on a spectrum:
| Source | Status | |
|---|---|---|
| Feasible, contaminated (used for training) | ||
| Feasible, clean (held out before training) | ||
| Infeasible, clean (locked away) |
The validation set is the on-hand simulation of the test set. Discipline is critical: once is used to select something, it’s no longer clean for any subsequent decision.
Mean and Variance
Two key statistical properties of :
Unbiased. . The validation error is an unbiased estimate of out-of-sample error. The proof is two lines of linearity-of-expectation, requiring only that wasn’t used to train .
Variance shrinks as . For i.i.d. validation samples,
— the bound saturates when per-example errors are Bernoulli(1/2) (worst case for binary classification, since ).
The K Trade-Off
large is good for the variance bound but bad for the training set:
- Small : trained on nearly examples is close to . But has high variance — the estimate is noisy.
- Large : is precise. But trained on examples is much worse than the you’d actually deploy.
The expected as a function of has wide error bars at both extremes, with a sweet spot in the middle. Practical rule of thumb: .
TIP — Always retrain on all before reporting
Validation chooses the hypothesis class . After that decision, retrain on (all examples) to produce the final . More data → better generalisation: . Reporting instead is leaving signal on the table.
Validation for Model Selection
The protocol:
- Split .
- For each candidate : train , compute .
- Pick .
- Retrain on all of . Report it.
The generalisation guarantee:
The rather than is benign — even a thousand candidates costs in the numerator.
Why does selecting by work, while selecting by does not?
Selecting by uses an unbiased estimate of for each candidate — comparing apples to apples. Selecting by uses a biased estimate (the data was used to fit , so is downward-biased). Worse, the bias depends on hypothesis-set complexity: more expressive classes get more shrinkage, so the comparison systematically favours overcomplex models. Validation breaks the cycle by evaluating on data the algorithm hasn’t seen.
Validation vs Regularisation
Both attack the same equation:
- Regularisation estimates the overfit penalty directly (the augmented error proxies the term).
- Validation estimates directly (held-out data bypasses the penalty entirely).
Two complementary cures, two sides of the same equation. Regularisation gives you a family of trained models indexed by ; validation picks the best from that family. Together they’re a complete model-selection pipeline.
Part 2: Cross-Validation — Use All The Data
Single-split validation has a frustrating constraint: data spent on isn’t available for training. Cross-validation removes the trade-off by using every example for both training and validation, just not at the same time.
Leave-One-Out
The extreme: for each , hold out example alone and train on the remaining :
Average:
Theorem. is an unbiased estimate of — the expected out-of-sample error when training with examples. The proof factorises the expectation: .
For practical purposes — since for moderate — practitioners say is “almost unbiased” for .
The Computational Catch
LOOCV trains the model times. For a deep network, infeasible. Special case: linear regression has a closed-form LOOCV via the hat matrix:
where . One ridge fit gives all leave-one-out errors. Useful trick — but doesn’t generalise beyond linear regression.
V-Fold: The Practical Compromise
Partition into equal-size parts. For each fold , train on the other parts and validate on the -th. Average the per-fold errors:
LOOCV is the special case . Practical rule of thumb: .
ASIDE — Why 10-fold is the de facto standard
10-fold CV gives an estimate of that is statistically nearly identical to LOOCV (since each training set is , very close to for large ), but at th of LOOCV’s compute for . The variance is also often lower than LOOCV, because LOOCV’s per-fold errors are highly correlated (training sets differ in 2 examples). 10-fold strikes the bias-variance-compute trade-off well enough that you should usually start there and only deviate with a specific reason.
Cross-Validation for Hyperparameter Selection
The standard recipe for choosing in ridge or lasso:
- Pick a grid of candidate ‘s: e.g. .
- For each , compute via 10-fold CV.
- Pick .
- Retrain on all of with . Report.
This is the practical answer to “how do you pick ?” — and it’s been waiting since week 8 when ridge was first introduced.
Part 3: Three Principles for Honest Learning
The module’s theoretical machinery — VC bound, bias–variance, regularisation, validation — all rests on certain assumptions. The Tuesday lecture wraps up by formalising three meta-principles whose violation breaks every theorem.
Occam’s Razor — The Simplest Model That Fits Wins
The principle, paraphrased from William of Occam (1287–1347): entities must not be multiplied beyond necessity.
In ML terms: among hypotheses that fit the data, prefer the simplest. Two senses of “simple”:
- Simple hypothesis : small — few parameters.
- Simple hypothesis set : small — small VC dimension.
The two are related: a hypothesis drawn from a low-complexity set is automatically a low-complexity hypothesis. Both senses point toward the same practical advice: start linear, then ask whether the data is being over-modelled before adding capacity.
“Better” in this context means better , not aesthetic elegance. The VC bound’s term is smaller for simpler classes, so when is comparable, simpler wins.
Sampling Bias — Wrong Distribution, No Recovery
If the data is sampled in a biased way, learning will produce a similarly biased outcome.
The VC bound’s i.i.d. assumption is load-bearing: training and test data must come from the same joint distribution. Violate it and the bound is silent — there’s no theorem that says on the deployment distribution is close to on a different one.
The lecture’s philosophical version: studying maths hard but being tested on English gives no strong test-performance guarantee. The classic real-world example: a 1936 magazine poll based on land-line phone owners predicted Landon over Roosevelt in a landslide; Roosevelt won by a landslide. The mismatch between the polled distribution and the voting distribution made the model useless, regardless of how “well-trained” it was on the (biased) data.
Data Snooping — The Insidious One
If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.
The strongest of the three. Any influence of the test set on any decision — preprocessing, feature engineering, hyperparameter choice — counts. Once the test set has been “snooped,” it’s no longer an unbiased estimator of .
The lecture’s vivid example: a financial trading strategy trained with snooping (test-period statistics used to normalise data) shows a cumulative profit of . The same strategy with proper normalisation (training-only) loses money. The “model” learned the test set’s statistics, not signal.
The clean workflow:
- Split at the very start.
- Lock in the safe. Don’t normalise with it. Don’t peek at it.
- Use for everything (with internal validation/CV splits as needed).
- One evaluation on at the end.
If you decide to “try one more thing” after that single test evaluation, the test set is now contaminated — for the next round, you’d need a fresh held-out set.
COMMON MISCONCEPTION — Validation = test
Validation and test sets play different roles. Validation is part of the training pipeline; you use it to select hyperparameters, and once you’ve done so, is no longer a clean estimate of . Testing is the final, untouched evaluation. If you’ve cross-validated to pick and the cross-validation error was 0.05, that’s the selection criterion, not an honest estimate of for the chosen . To honestly report , you’d need a fresh held-out test set never used for selection. Many published papers conflate these.
How the Three Interact
| Violation | Effect |
|---|---|
| Occam’s Razor (model too complex) | Overfitting: , bound is loose |
| Sampling Bias (wrong distribution) | on deployment on training distribution; bound silent |
| Data Snooping (test set used) | Reported is optimistic; true is worse than reported |
All three corrupt the inference chain from training to deployment. The discipline of the three principles is what turns the module’s machinery — VC, bias-variance, regularisation, validation — into honest learning.
Concepts Introduced This Week
- validation — held-out subset of for estimating unbiasedly. Variance for binary classification. The standard tool for picking hyperparameters.
- cross-validation — generalises validation to use all the data: LOOCV with folds (almost unbiased; trainings), -fold with folds ( practical default).
- learning-principles — three meta-principles: Occam’s Razor (simpler wins on ), Sampling Bias (wrong distribution kills every guarantee), Data Snooping (any test-set influence contaminates).
Connections
- Builds on week-10: regularisation produces a family of trained models indexed by ; this week’s validation/CV is how you actually choose . The two together complete the practical model-selection pipeline.
- Builds on week-09: the VC bound’s three assumptions (bounded complexity, i.i.d. data, independent test set) become the three learning principles. The principles are the practical-discipline counterpart to the bound’s mathematical hypotheses.
- Builds on generalization-bound: the validation generalisation bound has the same Hoeffding-union-bound structure but with small (a finite candidate list) so the term is benign — far tighter than the structural bound.
- Closes the module: with validation, regularisation, and the three principles, we now have everything needed to take a problem from “data on disk” to “trained model deployed responsibly”.
Open Questions
- The 1-SE rule: when the cross-validation curve has multiple ‘s within one standard error of the minimum, which to pick? The “1-SE rule” picks the largest within 1 SE of the min — preferring slightly more regularisation for robustness. Heuristic, common in practice.
- Nested cross-validation: when you both select via CV and evaluate via CV, the outer CV’s error is not quite an unbiased estimate of for the chosen model. Nested CV (outer for evaluation, inner for selection) fixes this rigorously but at training cost.
- Distribution shift in deployment: the i.i.d. assumption is the sampling-bias principle’s mathematical avatar, and it’s violated almost everywhere in production ML. Active research area: distributionally robust optimisation, domain adaptation, online learning.
- Implicit data snooping in benchmark culture: if many researchers test on the same benchmark and publish only the wins, the community-level reported numbers are biased downward — even though no individual experiment snooped. How to honestly evaluate a field’s progress is genuinely hard.