Splitting the dataset into a training set ( examples) and a validation set ( examples). Train on to get , then compute — an unbiased estimate of with variance for binary classification. Used for model selection by training candidates, evaluating each on , and picking the minimum.
Where Validation Sits
Three errors, three roles:
| Error | Computed from | Status |
|---|---|---|
| In-sample | Feasible on hand, but contaminated — the algorithm already used it to select . | |
| Test | Unbiased but infeasible — locked away, never touched. | |
| Validation | Feasible and clean — if was held out before training. |
The validation set is the on-hand simulation of the test set. The discipline is strict: feed only to the learning algorithm, never . Once has been used to select hyperparameters, it’s no longer “clean” for any subsequent decision.
Mean and Variance of
Validation error for a hypothesis trained on :
where is the pointwise error: for classification, for regression.
Unbiasedness. is an unbiased estimate of :
Proof: by linearity, .
Variance. With i.i.d. validation samples,
For binary classification, the per-example error is Bernoulli with , so , giving
As , and converges to .
The K Trade-Off
Bigger is good for the variance bound but bad for the trained model. The chain of approximations is:
- Large (lots of validation data, little training): tightly, but is much worse than trained on all — the validation estimate is accurate but for a poor hypothesis.
- Small : is close to , but has high variance — the estimate of is noisy.
The expected as a function of has wide error bars at both extremes, with a sweet spot in the middle. Practical rule of thumb: (an 80/20 split).
TIP — Why "report trained on all " rather than
After model selection picks via validation, retrain on the full examples to produce . The selected hypothesis class is fixed; using all the data gives a better fit. The bound usually holds because more training data means lower expected . Reporting is leaving signal on the table.
Validation for Model Selection
Given candidate hypothesis sets (different model classes, different values, different kernels, etc.):
- Split into () and ().
- For each : train and compute .
- Pick the winner: .
- Retrain on the full dataset and report it.
The generalisation guarantee for the selected model:
The term comes from a finite-bin Hoeffding union bound over the models — it’s analogous to the -dependence in week 8’s generalisation bound, but here is small (a finite list of candidates), so the term is benign.
Why Selection by Fails
If you select the model with smallest , you always pick the most expressive class:
- always preferred over (more flexibility);
- always preferred over (no regularisation).
Reason: minimising over pays the VC cost of the union, and the more expressive class wins. Selection by is the same as overfitting through the back door — the algorithm will always reach for capacity it shouldn’t have.
Why Selection by Is Infeasible (and “Cheating”)
The test set is locked away. If you peek at it during selection — even once — it’s no longer a test set; it’s now another validation set, and you’d need a fresh test set to assess the final model. Using for model selection is a form of data snooping that breaks every generalisation guarantee that depended on the test set being independent.
Validation vs Regularisation
Both regularisation and validation address the same equation:
- Regularisation estimates the overfit penalty directly via the augmented error — proxying the term that would otherwise be unobservable.
- Validation estimates directly via held-out data — bypassing the penalty term entirely.
Two different sides of the same equation. They’re complementary: regularisation gives you a family of models indexed by ; validation picks the best from that family.
Related
- cross-validation — generalises validation to use all the data for both training and validation.
- regularization — the other cure for overfitting; validation picks its hyperparameters.
- generalization-bound — the worst-case bound that motivates the need for validation.
- overfitting — the disease validation diagnoses.
- data-snooping — what you must not do with the validation/test sets.
Active Recall
Why is an unbiased estimate of , and what assumption does the proof require?
By definition . Taking expectation over the randomness of and using linearity: . The proof requires to be drawn i.i.d. from the same as the test data, and that was not used to train (otherwise depends on and the expectation factorisation breaks).
Show that the variance of for binary classification satisfies , and explain what this implies for choosing .
The per-example error is Bernoulli with parameter , so its variance is . By independence of i.i.d. samples, (the maximum of on is , achieved at ). Implication: increasing shrinks the variance of the validation estimate at rate , but also shrinks the training set, hurting the trained . The trade-off has a sweet spot — the rule of thumb is .
After validation picks the best model class , why should you retrain on the full examples rather than reporting ?
was trained on only examples; uses all . More data typically means better generalisation, so — reporting leaves signal on the table. The validation set’s job was to choose which hypothesis class to use; once that decision is locked in, there’s no reason not to use the full data for the final fit.
A student picks the model with the smallest rather than smallest . Why does this systematically pick overcomplicated models?
Minimising over pays the VC cost of the union (which is at least as expressive as the most complex member). The most expressive class always achieves the lowest — for example, beats any on training error, and a degree-1126 polynomial beats a degree-1. The student’s procedure is structurally biased toward overfitting. Validation breaks the cycle by evaluating each candidate on data the algorithm never saw — providing an unbiased estimate of for each.
The generalisation bound for validation-based selection is . Why does only (not ) appear, and why is this comforting in practice?
The comes from a Hoeffding union bound over candidates evaluated on the validation set: . Setting equal to and solving gives . The logarithm is comforting because it grows very slowly: doubling the candidate list adds only to the numerator. Even with thousands of candidates, the validation penalty is small compared to the validation error itself.