Splitting the dataset into a training set ( examples) and a validation set ( examples). Train on to get , then compute — an unbiased estimate of with variance for binary classification. Used for model selection by training candidates, evaluating each on , and picking the minimum.

Where Validation Sits

Three errors, three roles:

ErrorComputed fromStatus
In-sample Feasible on hand, but contaminated — the algorithm already used it to select .
Test Unbiased but infeasible — locked away, never touched.
Validation Feasible and clean — if was held out before training.

The validation set is the on-hand simulation of the test set. The discipline is strict: feed only to the learning algorithm, never . Once has been used to select hyperparameters, it’s no longer “clean” for any subsequent decision.

Mean and Variance of

Validation error for a hypothesis trained on :

where is the pointwise error: for classification, for regression.

Unbiasedness. is an unbiased estimate of :

Proof: by linearity, .

Variance. With i.i.d. validation samples,

For binary classification, the per-example error is Bernoulli with , so , giving

As , and converges to .

The K Trade-Off

Bigger is good for the variance bound but bad for the trained model. The chain of approximations is:

  • Large (lots of validation data, little training): tightly, but is much worse than trained on all — the validation estimate is accurate but for a poor hypothesis.
  • Small : is close to , but has high variance — the estimate of is noisy.

The expected as a function of has wide error bars at both extremes, with a sweet spot in the middle. Practical rule of thumb: (an 80/20 split).

TIP — Why "report trained on all " rather than

After model selection picks via validation, retrain on the full examples to produce . The selected hypothesis class is fixed; using all the data gives a better fit. The bound usually holds because more training data means lower expected . Reporting is leaving signal on the table.

Validation for Model Selection

Given candidate hypothesis sets (different model classes, different values, different kernels, etc.):

  1. Split into () and ().
  2. For each : train and compute .
  3. Pick the winner: .
  4. Retrain on the full dataset and report it.

The generalisation guarantee for the selected model:

The term comes from a finite-bin Hoeffding union bound over the models — it’s analogous to the -dependence in week 8’s generalisation bound, but here is small (a finite list of candidates), so the term is benign.

Why Selection by Fails

If you select the model with smallest , you always pick the most expressive class:

  • always preferred over (more flexibility);
  • always preferred over (no regularisation).

Reason: minimising over pays the VC cost of the union, and the more expressive class wins. Selection by is the same as overfitting through the back door — the algorithm will always reach for capacity it shouldn’t have.

Why Selection by Is Infeasible (and “Cheating”)

The test set is locked away. If you peek at it during selection — even once — it’s no longer a test set; it’s now another validation set, and you’d need a fresh test set to assess the final model. Using for model selection is a form of data snooping that breaks every generalisation guarantee that depended on the test set being independent.

Validation vs Regularisation

Both regularisation and validation address the same equation:

  • Regularisation estimates the overfit penalty directly via the augmented error — proxying the term that would otherwise be unobservable.
  • Validation estimates directly via held-out data — bypassing the penalty term entirely.

Two different sides of the same equation. They’re complementary: regularisation gives you a family of models indexed by ; validation picks the best from that family.

  • cross-validation — generalises validation to use all the data for both training and validation.
  • regularization — the other cure for overfitting; validation picks its hyperparameters.
  • generalization-bound — the worst-case bound that motivates the need for validation.
  • overfitting — the disease validation diagnoses.
  • data-snooping — what you must not do with the validation/test sets.

Active Recall