Validation, Cross-Validation, and Three Principles for Honest Learning

THE CRUX: Last week's regularisation gave us a family of trained models indexed by $λ$ — but how do we actually pick the right $λ$ ? More broadly: with the regularisation parameter, model class, kernel, and other knobs all unknown, how do we choose between candidate models without the test set we promised never to touch? And once we've answered that — what general principles, beyond specific algorithms, govern whether a learning experiment was honest in the first place?

The two halves of week 11 each answer one. (1) Validation holds out a portion of the training data, trains on the rest, and uses the held-out part to estimate $E_{out}$ for each candidate model. The validation error $E_{val} (g^{-})$ is an unbiased estimate of $E_{out} (g^{-})$ with variance $\leq 1/ (4 K)$ for binary classification — large $K$ shrinks variance but starves the trained model. The standard $K = N /5$ rule splits the difference. After validation picks the winner, retrain on all $N$ examples to produce the final hypothesis. Cross-validation generalises validation to use all the data: leave-one-out trains $N$ times on $N - 1$ examples each, $V$ -fold partitions into $V$ chunks ( $V = 10$ is the practical default). (2) Three meta-principles wrap up the module’s theory. Occam’s Razor: simpler is better, where “simpler” means small VC dimension and “better” means lower $E_{out}$ . Sampling Bias: training data from the wrong distribution → no theorem rescues you. Data Snooping: any influence of the test set on any decision contaminates it. Together they’re the discipline that turns the module’s machinery into honest learning.

Part 1: How Do You Actually Pick $λ$ ?

The Model Selection Problem

Last week we built regularisation: minimise $E_{in} + (λ / N) ∥ w ∥^{2}$ for some choice of $λ$ . But “for some choice of $λ$ ” is doing heavy lifting. The optimal $λ^{*}$ depends on the noise level $σ^{2}$ , target complexity $Q_{f}$ , and sample size $N$ — all of which we don’t observe.

More generally: we have $M$ candidate models $H_{1}, \dots, H_{M}$ — different hypothesis classes (linear, quadratic, kernel SVM, …), different regularisers, different $λ$ values, different feature transformations. How do we pick?

Three obvious-looking choices, all wrong:

Pick by smallest $E_{in}$ : always selects the most expressive class, since a richer $H$ can always fit training data better. This is overfitting through the back door — minimising $E_{in}$ over $H_{1} \cup \dots \cup H_{M}$ pays the VC cost of the union, and the most complex member always wins.
Pick by smallest $E_{test}$ : would give an unbiased generalisation guarantee with $O (lo g M / N_{test})$ error, but $D_{test}$ is locked in the boss’s safe — using it for selection would be infeasible and cheating.
Pick by intuition / cross-paper folklore: invisible bias accumulates; not a procedure.

Validation: A Held-Out Slice of Training Data

The fix is structural: split $D$ into a training set $D_{train}$ ( $N - K$ examples) and a validation set $D_{val}$ ( $K$ examples). Train on $D_{train}$ to get $g^{-}$ . Then compute

$E_{val} (g^{-}) = \frac{1}{K} \sum_{x_{n} \in D_{val}} e (g^{-} (x_{n}), y_{n}) .$

Three error types now sit on a spectrum:

	Source	Status
$E_{in}$	$D$	Feasible, contaminated (used for training)
$E_{val}$	$D_{val} \subset D$	Feasible, clean (held out before training)
$E_{test}$	$D_{test}$	Infeasible, clean (locked away)

The validation set is the on-hand simulation of the test set. Discipline is critical: once $D_{val}$ is used to select something, it’s no longer clean for any subsequent decision.

Mean and Variance

Two key statistical properties of $E_{val}$ :

Unbiased. $E_{D_{val}} [E_{val} (g^{-})] = E_{out} (g^{-})$ . The validation error is an unbiased estimate of out-of-sample error. The proof is two lines of linearity-of-expectation, requiring only that $D_{val}$ wasn’t used to train $g^{-}$ .

Variance shrinks as $1/ K$ . For i.i.d. validation samples,

$σ_{val}^{2} = \frac{σ ^{2} ( g ^{-} )}{K} \leq \frac{1}{4 K}$

— the bound saturates when per-example errors are Bernoulli(1/2) (worst case for binary classification, since $p (1 - p) \leq 1/4$ ).

The K Trade-Off

$K$ large is good for the variance bound but bad for the training set:

$E_{out} (g) small K \approx E_{out} (g^{-}) large K \approx E_{val} (g^{-}) .$

Small $K$ : $g^{-}$ trained on nearly $N$ examples is close to $g$ . But $E_{val}$ has high variance — the estimate is noisy.
Large $K$ : $E_{val}$ is precise. But $g^{-}$ trained on $N - K$ examples is much worse than the $g$ you’d actually deploy.

The expected $E_{val}$ as a function of $K$ has wide error bars at both extremes, with a sweet spot in the middle. Practical rule of thumb: $K = N /5$ .

TIP — Always retrain on all $N$ before reporting

Validation chooses the hypothesis class $H_{m^{*}}$ . After that decision, retrain on $D$ (all $N$ examples) to produce the final $g_{m^{*}}$ . More data → better generalisation: $E_{out} (g_{m^{*}}) \leq E_{out} (g_{m^{*}}^{-})$ . Reporting $g^{-}$ instead is leaving signal on the table.

Validation for Model Selection

The protocol:

Split $D \to D_{train}, D_{val}$ .
For each candidate $m = 1, \dots, M$ : train $g_{m}^{-} = A_{m} (D_{train})$ , compute $E_{m} = E_{val} (g_{m}^{-})$ .
Pick $m^{*} = ar g min_{m} E_{m}$ .
Retrain $g_{m^{*}}$ on all of $D$ . Report it.

The generalisation guarantee:

$E_{out} (g_{m^{*}}) \leq E_{out} (g_{m^{*}}^{-}) \leq E_{val} (g_{m^{*}}^{-}) + O (\frac{l o g M}{K}) .$

The $lo g M$ rather than $M$ is benign — even a thousand candidates costs $lo g 1000 \approx 7$ in the numerator.

Why does selecting by $E_{val}$ work, while selecting by $E_{in}$ does not?

Selecting by $E_{val}$ uses an unbiased estimate of $E_{out} (g_{m}^{-})$ for each candidate — comparing apples to apples. Selecting by $E_{in}$ uses a biased estimate (the data was used to fit $g_{m}$ , so $E_{in}$ is downward-biased). Worse, the bias depends on hypothesis-set complexity: more expressive classes get more $E_{in}$ shrinkage, so the comparison systematically favours overcomplex models. Validation breaks the cycle by evaluating on data the algorithm hasn’t seen.

Validation vs Regularisation

Both attack the same equation:

$E_{out} (h) = E_{in} (h) + overfit penalty .$

Regularisation estimates the overfit penalty directly (the augmented error $(λ / N) Ω (w)$ proxies the term).
Validation estimates $E_{out}$ directly (held-out data bypasses the penalty entirely).

Two complementary cures, two sides of the same equation. Regularisation gives you a family of trained models indexed by $λ$ ; validation picks the best $λ$ from that family. Together they’re a complete model-selection pipeline.

Part 2: Cross-Validation — Use All The Data

Single-split validation has a frustrating constraint: data spent on $D_{val}$ isn’t available for training. Cross-validation removes the trade-off by using every example for both training and validation, just not at the same time.

Leave-One-Out

The extreme: for each $n \in {1, \dots, N}$ , hold out example $n$ alone and train on the remaining $N - 1$ :

$g_{n}^{-} = A (D ∖ {(x_{n}, y_{n})}), e_{n} = e (g_{n}^{-} (x_{n}), y_{n}) .$

Average:

$E_{cv} = \frac{1}{N} n = 1 \sum N e_{n} .$

Theorem. $E_{cv}$ is an unbiased estimate of $\overset{ˉ}{E}_{out} (N - 1)$ — the expected out-of-sample error when training with $N - 1$ examples. The proof factorises the expectation: $E_{D} [e_{n}] = E_{D ∖ {n}} E_{(x_{n}, y_{n})} [e_{n}] = E_{D ∖ {n}} [E_{out} (g_{n}^{-})] = \overset{ˉ}{E}_{out} (N - 1)$ .

For practical purposes — since $N - 1 \approx N$ for moderate $N$ — practitioners say $E_{cv}$ is “almost unbiased” for $E_{out} (g)$ .

The Computational Catch

LOOCV trains the model $N$ times. For a deep network, infeasible. Special case: linear regression has a closed-form LOOCV via the hat matrix:

$E_{cv} = \frac{1}{N} \sum_{n = 1}^{N} (\frac{y ^ _{n} - y _{n}}{1 - H _{n, n} ( λ )})^{2}$

where $H (λ) = A (A^{⊤} A + λ I)^{- 1} A^{⊤}$ . One ridge fit gives all $N$ leave-one-out errors. Useful trick — but doesn’t generalise beyond linear regression.

V-Fold: The Practical Compromise

Partition $D$ into $V$ equal-size parts. For each fold $v$ , train on the other $V - 1$ parts and validate on the $v$ -th. Average the $V$ per-fold errors:

$E_{cv} (H, A) = \frac{1}{V} \sum_{v = 1}^{V} E_{val}^{(v)} (g_{v}^{-}) .$

LOOCV is the special case $V = N$ . Practical rule of thumb: $V = 10$ .

ASIDE — Why 10-fold is the de facto standard

10-fold CV gives an estimate of $E_{out}$ that is statistically nearly identical to LOOCV (since each training set is $0.9 N$ , very close to $N - 1$ for large $N$ ), but at $1/100$ th of LOOCV’s compute for $N = 1000$ . The variance is also often lower than LOOCV, because LOOCV’s $N$ per-fold errors are highly correlated (training sets differ in 2 examples). 10-fold strikes the bias-variance-compute trade-off well enough that you should usually start there and only deviate with a specific reason.

Cross-Validation for Hyperparameter Selection

The standard recipe for choosing $λ$ in ridge or lasso:

Pick a grid of candidate $λ$ ‘s: e.g. ${0.001, 0.01, 0.1, 1, 10, 100}$ .
For each $λ$ , compute $E_{cv} (H_{λ}, A_{λ})$ via 10-fold CV.
Pick $λ^{*} = ar g min E_{cv}$ .
Retrain on all of $D$ with $λ^{*}$ . Report.

This is the practical answer to “how do you pick $λ$ ?” — and it’s been waiting since week 8 when ridge was first introduced.

Part 3: Three Principles for Honest Learning

The module’s theoretical machinery — VC bound, bias–variance, regularisation, validation — all rests on certain assumptions. The Tuesday lecture wraps up by formalising three meta-principles whose violation breaks every theorem.

Occam’s Razor — The Simplest Model That Fits Wins

The principle, paraphrased from William of Occam (1287–1347): entities must not be multiplied beyond necessity.

In ML terms: among hypotheses that fit the data, prefer the simplest. Two senses of “simple”:

Simple hypothesis $h$ : small $Ω (h)$ — few parameters.
Simple hypothesis set $H$ : small $Ω (H)$ — small VC dimension.

The two are related: a hypothesis drawn from a low-complexity set is automatically a low-complexity hypothesis. Both senses point toward the same practical advice: start linear, then ask whether the data is being over-modelled before adding capacity.

“Better” in this context means better $E_{out}$ , not aesthetic elegance. The VC bound’s $Ω (H)$ term is smaller for simpler classes, so when $E_{in}$ is comparable, simpler wins.

Sampling Bias — Wrong Distribution, No Recovery

If the data is sampled in a biased way, learning will produce a similarly biased outcome.

The VC bound’s i.i.d. assumption is load-bearing: training and test data must come from the same joint distribution. Violate it and the bound is silent — there’s no theorem that says $E_{out}$ on the deployment distribution is close to $E_{in}$ on a different one.

The lecture’s philosophical version: studying maths hard but being tested on English gives no strong test-performance guarantee. The classic real-world example: a 1936 magazine poll based on land-line phone owners predicted Landon over Roosevelt in a landslide; Roosevelt won by a landslide. The mismatch between the polled distribution and the voting distribution made the model useless, regardless of how “well-trained” it was on the (biased) data.

Data Snooping — The Insidious One

If a data set has affected any step in the learning process, its ability to assess the outcome has been compromised.

The strongest of the three. Any influence of the test set on any decision — preprocessing, feature engineering, hyperparameter choice — counts. Once the test set has been “snooped,” it’s no longer an unbiased estimator of $E_{out}$ .

The lecture’s vivid example: a financial trading strategy trained with snooping (test-period statistics used to normalise data) shows a cumulative profit of $30% +$ . The same strategy with proper normalisation (training-only) loses money. The “model” learned the test set’s statistics, not signal.

The clean workflow:

Split $D \to D_{train}, D_{test}$ at the very start.
Lock $D_{test}$ in the safe. Don’t normalise with it. Don’t peek at it.
Use $D_{train}$ for everything (with internal validation/CV splits as needed).
One evaluation on $D_{test}$ at the end.

If you decide to “try one more thing” after that single test evaluation, the test set is now contaminated — for the next round, you’d need a fresh held-out set.

COMMON MISCONCEPTION — Validation = test

Validation and test sets play different roles. Validation is part of the training pipeline; you use it to select hyperparameters, and once you’ve done so, $E_{val}$ is no longer a clean estimate of $E_{out}$ . Testing is the final, untouched evaluation. If you’ve cross-validated to pick $λ$ and the cross-validation error was 0.05, that’s the selection criterion, not an honest estimate of $E_{out}$ for the chosen $λ$ . To honestly report $E_{out}$ , you’d need a fresh held-out test set never used for selection. Many published papers conflate these.

How the Three Interact

Violation	Effect
Occam’s Razor (model too complex)	Overfitting: $E_{out} ≫ E_{in}$ , bound is loose
Sampling Bias (wrong distribution)	$E_{out}$ on deployment $≫ E_{out}$ on training distribution; bound silent
Data Snooping (test set used)	Reported $E_{test}$ is optimistic; true $E_{out}$ is worse than reported

All three corrupt the inference chain from training to deployment. The discipline of the three principles is what turns the module’s machinery — VC, bias-variance, regularisation, validation — into honest learning.

Concepts Introduced This Week

validation — held-out subset of $D$ for estimating $E_{out}$ unbiasedly. Variance $\leq 1/ (4 K)$ for binary classification. The standard tool for picking hyperparameters.
cross-validation — generalises validation to use all the data: LOOCV with $N$ folds (almost unbiased; $N$ trainings), $V$ -fold with $V$ folds ( $V = 10$ practical default).
learning-principles — three meta-principles: Occam’s Razor (simpler wins on $E_{out}$ ), Sampling Bias (wrong distribution kills every guarantee), Data Snooping (any test-set influence contaminates).

Connections

Builds on week-10: regularisation produces a family of trained models indexed by $λ$ ; this week’s validation/CV is how you actually choose $λ$ . The two together complete the practical model-selection pipeline.
Builds on week-09: the VC bound’s three assumptions (bounded complexity, i.i.d. data, independent test set) become the three learning principles. The principles are the practical-discipline counterpart to the bound’s mathematical hypotheses.
Builds on generalization-bound: the validation generalisation bound $E_{out} \leq E_{val} + O (lo g M / K)$ has the same Hoeffding-union-bound structure but with $M$ small (a finite candidate list) so the term is benign — far tighter than the $lo g m_{H} (2 N) / N$ structural bound.
Closes the module: with validation, regularisation, and the three principles, we now have everything needed to take a problem from “data on disk” to “trained model deployed responsibly”.

Open Questions

The 1-SE rule: when the cross-validation curve has multiple $λ$ ‘s within one standard error of the minimum, which to pick? The “1-SE rule” picks the largest $λ$ within 1 SE of the min — preferring slightly more regularisation for robustness. Heuristic, common in practice.
Nested cross-validation: when you both select via CV and evaluate via CV, the outer CV’s error is not quite an unbiased estimate of $E_{out}$ for the chosen model. Nested CV (outer for evaluation, inner for selection) fixes this rigorously but at $V_{1} \cdot V_{2}$ training cost.
Distribution shift in deployment: the i.i.d. assumption is the sampling-bias principle’s mathematical avatar, and it’s violated almost everywhere in production ML. Active research area: distributionally robust optimisation, domain adaptation, online learning.
Implicit data snooping in benchmark culture: if many researchers test on the same benchmark and publish only the wins, the community-level reported numbers are biased downward — even though no individual experiment snooped. How to honestly evaluate a field’s progress is genuinely hard.

Course Notes

Explorer

Validation, Cross-Validation, and Three Principles for Honest Learning

Part 1: How Do You Actually Pick $λ$ ?

The Model Selection Problem

Validation: A Held-Out Slice of Training Data

Mean and Variance

The K Trade-Off

Validation for Model Selection

Validation vs Regularisation

Part 2: Cross-Validation — Use All The Data

Leave-One-Out

The Computational Catch

V-Fold: The Practical Compromise

Cross-Validation for Hyperparameter Selection

Part 3: Three Principles for Honest Learning

Occam’s Razor — The Simplest Model That Fits Wins

Sampling Bias — Wrong Distribution, No Recovery

Data Snooping — The Insidious One

How the Three Interact

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

Validation, Cross-Validation, and Three Principles for Honest Learning

Part 1: How Do You Actually Pick λ?

The Model Selection Problem

Validation: A Held-Out Slice of Training Data

Mean and Variance

The K Trade-Off

Validation for Model Selection

Validation vs Regularisation

Part 2: Cross-Validation — Use All The Data

Leave-One-Out

The Computational Catch

V-Fold: The Practical Compromise

Cross-Validation for Hyperparameter Selection

Part 3: Three Principles for Honest Learning

Occam’s Razor — The Simplest Model That Fits Wins

Sampling Bias — Wrong Distribution, No Recovery

Data Snooping — The Insidious One

How the Three Interact

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Part 1: How Do You Actually Pick $λ$ ?