validation

Splitting the dataset $D$ into a training set $D_{train}$ ( $N - K$ examples) and a validation set $D_{val}$ ( $K$ examples). Train on $D_{train}$ to get $g^{-}$ , then compute $E_{val} (g^{-}) = \frac{1}{K} \sum_{x_{n} \in D_{val}} e (g^{-} (x_{n}), y_{n})$ — an unbiased estimate of $E_{out} (g^{-})$ with variance $\leq 1/ (4 K)$ for binary classification. Used for model selection by training $M$ candidates, evaluating each on $D_{val}$ , and picking the minimum.

Where Validation Sits

Three errors, three roles:

Error	Computed from	Status
In-sample $E_{in}$	$D$	Feasible on hand, but contaminated — the algorithm already used it to select $g$ .
Test $E_{test}$	$D_{test}$	Unbiased but infeasible — locked away, never touched.
Validation $E_{val}$	$D_{val} \subset D$	Feasible and clean — if $D_{val}$ was held out before training.

The validation set is the on-hand simulation of the test set. The discipline is strict: feed only $D_{train}$ to the learning algorithm, never $D_{val}$ . Once $D_{val}$ has been used to select hyperparameters, it’s no longer “clean” for any subsequent decision.

Mean and Variance of $E_{val}$

Validation error for a hypothesis $g^{-}$ trained on $D_{train}$ :

$E_{val} (g^{-}) = \frac{1}{K} \sum_{x_{n} \in D_{val}} e (g^{-} (x_{n}), y_{n})$

where $e$ is the pointwise error: $e (g^{-} (x), y) = 1 [g^{-} (x) \neq = y]$ for classification, $(g^{-} (x) - y)^{2}$ for regression.

Unbiasedness. $E_{val} (g^{-})$ is an unbiased estimate of $E_{out} (g^{-})$ :

$E_{D_{val}} [E_{val} (g^{-})] = E_{out} (g^{-}) .$

Proof: by linearity, $E_{D_{val}} [\frac{1}{K} \sum e (g^{-} (x_{n}), y_{n})] = \frac{1}{K} \sum E_{x_{n}} [e (g^{-} (x_{n}), y_{n})] = E_{out} (g^{-})$ .

Variance. With i.i.d. validation samples,

$σ_{val}^{2} = \frac{1}{K ^{2}} \sum_{n} Var_{x_{n}} [e (g^{-} (x_{n}), y_{n})] = \frac{σ ^{2} ( g ^{-} )}{K} .$

For binary classification, the per-example error is Bernoulli with $p = P [g^{-} (x) \neq = y]$ , so $σ^{2} (g^{-}) = p (1 - p) \leq 1/4$ , giving

$σ_{val}^{2} \leq \frac{1}{4 K} .$

As $K \to \infty$ , $σ_{val}^{2} \to 0$ and $E_{val}$ converges to $E_{out} (g^{-})$ .

The K Trade-Off

Bigger $K$ is good for the variance bound but bad for the trained model. The chain of approximations is:

$E_{out} (g) small K \approx E_{out} (g^{-}) large K \approx E_{val} (g^{-})$

Large $K$ (lots of validation data, little training): $E_{val} \approx E_{out} (g^{-})$ tightly, but $g^{-}$ is much worse than $g$ trained on all $N$ — the validation estimate is accurate but for a poor hypothesis.
Small $K$ : $g^{-}$ is close to $g$ , but $E_{val}$ has high variance — the estimate of $E_{out} (g^{-})$ is noisy.

The expected $E_{val}$ as a function of $K$ has wide error bars at both extremes, with a sweet spot in the middle. Practical rule of thumb: $K = N /5$ (an 80/20 split).

TIP — Why "report $g$ trained on all $N$ " rather than $g^{-}$

After model selection picks $H_{m^{*}}$ via validation, retrain on the full $N$ examples to produce $g_{m^{*}}$ . The selected hypothesis class is fixed; using all the data gives a better fit. The bound $E_{out} (g_{m^{*}}) \leq E_{out} (g_{m^{*}}^{-})$ usually holds because more training data means lower expected $E_{out}$ . Reporting $g^{-}$ is leaving signal on the table.

Validation for Model Selection

Given $M$ candidate hypothesis sets $H_{1}, \dots, H_{M}$ (different model classes, different $λ$ values, different kernels, etc.):

Split $D$ into $D_{train}$ ( $N - K$ ) and $D_{val}$ ( $K$ ).
For each $m$ : train $g_{m}^{-} = A_{m} (D_{train})$ and compute $E_{m} = E_{val} (g_{m}^{-})$ .
Pick the winner: $m^{*} = ar g min_{m} E_{m}$ .
Retrain $g_{m^{*}} = A_{m^{*}} (D)$ on the full dataset and report it.

The generalisation guarantee for the selected model:

$E_{out} (g_{m^{*}}) \leq E_{out} (g_{m^{*}}^{-}) \leq E_{val} (g_{m^{*}}^{-}) + O (\frac{l o g M}{K}) .$

The $lo g M / K$ term comes from a finite-bin Hoeffding union bound over the $M$ models — it’s analogous to the $M$ -dependence in week 8’s generalisation bound, but here $M$ is small (a finite list of candidates), so the term is benign.

Why Selection by $E_{in}$ Fails

If you select the model with smallest $E_{in}$ , you always pick the most expressive class:

$Φ_{1126}$ always preferred over $Φ_{1}$ (more flexibility);
$λ = 0$ always preferred over $λ = 0.1$ (no regularisation).

Reason: minimising $E_{in}$ over $H_{1} \cup H_{2}$ pays the VC cost of the union, and the more expressive class wins. Selection by $E_{in}$ is the same as overfitting through the back door — the algorithm will always reach for capacity it shouldn’t have.

Why Selection by $E_{test}$ Is Infeasible (and “Cheating”)

The test set is locked away. If you peek at it during selection — even once — it’s no longer a test set; it’s now another validation set, and you’d need a fresh test set to assess the final model. Using $D_{test}$ for model selection is a form of data snooping that breaks every generalisation guarantee that depended on the test set being independent.

Validation vs Regularisation

Both regularisation and validation address the same equation:

$E_{out} (h) = E_{in} (h) + overfit penalty .$

Regularisation estimates the overfit penalty directly via the augmented error $E_{in} + (λ / N) Ω (w)$ — proxying the term that would otherwise be unobservable.
Validation estimates $E_{out}$ directly via held-out data — bypassing the penalty term entirely.

Two different sides of the same equation. They’re complementary: regularisation gives you a family of models indexed by $λ$ ; validation picks the best $λ$ from that family.

cross-validation — generalises validation to use all the data for both training and validation.
regularization — the other cure for overfitting; validation picks its hyperparameters.
generalization-bound — the worst-case bound that motivates the need for validation.
overfitting — the disease validation diagnoses.
data-snooping — what you must not do with the validation/test sets.

Active Recall

Why is $E_{val} (g^{-})$ an unbiased estimate of $E_{out} (g^{-})$ , and what assumption does the proof require?

By definition $E_{val} (g^{-}) = \frac{1}{K} \sum_{x_{n} \in D_{val}} e (g^{-} (x_{n}), y_{n})$ . Taking expectation over the randomness of $D_{val}$ and using linearity: $E_{D_{val}} [E_{val} (g^{-})] = \frac{1}{K} \sum_{n} E_{x_{n}} [e (g^{-} (x_{n}), y_{n})] = \frac{1}{K} \cdot K \cdot E_{out} (g^{-}) = E_{out} (g^{-})$ . The proof requires $D_{val}$ to be drawn i.i.d. from the same $P (x, y)$ as the test data, and that $D_{val}$ was not used to train $g^{-}$ (otherwise $g^{-}$ depends on $D_{val}$ and the expectation factorisation breaks).

Show that the variance of $E_{val}$ for binary classification satisfies $σ_{val}^{2} \leq 1/ (4 K)$ , and explain what this implies for choosing $K$ .

The per-example error $1 [g^{-} (x) \neq = y]$ is Bernoulli with parameter $p = P [g^{-} (x) \neq = y]$ , so its variance is $p (1 - p)$ . By independence of i.i.d. samples, $σ_{val}^{2} = \frac{1}{K ^{2}} \sum p (1 - p) = \frac{p ( 1 - p )}{K} \leq \frac{1}{4 K}$ (the maximum of $p (1 - p)$ on $[0, 1]$ is $1/4$ , achieved at $p = 1/2$ ). Implication: increasing $K$ shrinks the variance of the validation estimate at rate $1/ K$ , but also shrinks the training set, hurting the trained $g^{-}$ . The trade-off has a sweet spot — the rule of thumb is $K \approx N /5$ .

After validation picks the best model class $H_{m^{*}}$ , why should you retrain on the full $N$ examples rather than reporting $g_{m^{*}}^{-}$ ?

$g_{m^{*}}^{-}$ was trained on only $N - K$ examples; $g_{m^{*}} = A_{m^{*}} (D)$ uses all $N$ . More data typically means better generalisation, so $E_{out} (g_{m^{*}}) \leq E_{out} (g_{m^{*}}^{-})$ — reporting $g^{-}$ leaves signal on the table. The validation set’s job was to choose which hypothesis class to use; once that decision is locked in, there’s no reason not to use the full data for the final fit.

A student picks the model with the smallest $E_{in}$ rather than smallest $E_{val}$ . Why does this systematically pick overcomplicated models?

Minimising $E_{in}$ over $H_{1} \cup \dots \cup H_{M}$ pays the VC cost of the union (which is at least as expressive as the most complex member). The most expressive class always achieves the lowest $E_{in}$ — for example, $λ = 0$ beats any $λ > 0$ on training error, and a degree-1126 polynomial beats a degree-1. The student’s procedure is structurally biased toward overfitting. Validation breaks the cycle by evaluating each candidate on data the algorithm never saw — providing an unbiased estimate of $E_{out}$ for each.

The generalisation bound for validation-based selection is $E_{out} (g_{m^{*}}) \leq E_{val} (g_{m^{*}}^{-}) + O (lo g M / K)$ . Why does only $lo g M$ (not $M$ ) appear, and why is this comforting in practice?

The $lo g M$ comes from a Hoeffding union bound over $M$ candidates evaluated on the validation set: $P [\exists m : ∣ E_{val} (g_{m}^{-}) - E_{out} (g_{m}^{-}) ∣ > ϵ] \leq M \cdot 2 e^{- 2 ϵ^{2} K}$ . Setting equal to $δ$ and solving gives $ϵ = O (lo g (M / δ) / K)$ . The logarithm is comforting because it grows very slowly: doubling the candidate list adds only $lo g 2 \approx 0.69$ to the numerator. Even with thousands of $λ$ candidates, the validation penalty is small compared to the validation error itself.

Course Notes

Explorer

validation

Where Validation Sits

Mean and Variance of $E_{val}$

The K Trade-Off

Validation for Model Selection

Why Selection by $E_{in}$ Fails

Why Selection by $E_{test}$ Is Infeasible (and “Cheating”)

Validation vs Regularisation

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

validation

Where Validation Sits

Mean and Variance of Eval​

The K Trade-Off

Validation for Model Selection

Why Selection by Ein​ Fails

Why Selection by Etest​ Is Infeasible (and “Cheating”)

Validation vs Regularisation

Related

Active Recall

Graph View

Table of Contents

Backlinks

Mean and Variance of $E_{val}$

Why Selection by $E_{in}$ Fails

Why Selection by $E_{test}$ Is Infeasible (and “Cheating”)