cross-validation

Repeating the validation procedure across multiple held-out subsets (“folds”) and averaging the per-fold validation errors. Leave-one-out CV (LOOCV) holds out one example at a time, training on $N - 1$ and evaluating on 1, repeated $N$ times: $E_{cv} = \frac{1}{N} \sum_{n} e_{n}$ . V-fold CV partitions $D$ into $V$ equal parts, trains on $V - 1$ of them, validates on the remaining one, repeats $V$ times. Typical $V = 10$ . $E_{cv}$ is an (almost) unbiased estimate of $E_{out}$ , and dramatically lower-variance than a single train/val split.

The Motivation

Single-split validation forces a trade-off: large $K$ means a precise validation estimate but a poorly-trained $g^{-}$ ; small $K$ means a well-trained $g^{-}$ but a noisy validation estimate. Cross-validation circumvents the trade-off by using every example for both training and validation, just not at the same time.

The cost: more computation (training $V$ or $N$ times instead of once) — but one estimate of $E_{out}$ that’s both unbiased and low-variance.

Leave-One-Out Cross-Validation (LOOCV)

For each $n = 1, \dots, N$ :

Define $D_{n} = D ∖ {(x_{n}, y_{n})}$ — all data except example $n$ .
Train: $g_{n}^{-} = A (D_{n})$ .
Compute the single-point error: $e_{n} = e (g_{n}^{-} (x_{n}), y_{n})$ .

Average over all $N$ runs:

$E_{cv} = \frac{1}{N} n = 1 \sum N e_{n} .$

Each $e_{n}$ is the validation error of a model trained on $N - 1$ examples evaluated on the held-out one. LOOCV uses every example as both training (in $N - 1$ runs) and validation (in 1 run).

Unbiasedness

Theorem. $E_{cv}$ is an unbiased estimate of $\overset{ˉ}{E}_{out} (N - 1)$ — the expected $E_{out}$ when training with $N - 1$ points.

Proof. Linearity of expectation gives $E_{D} [E_{cv}] = E_{D} [e_{n}]$ (by symmetry, all $E [e_{n}]$ are equal). Decompose: $E_{D} [e_{n}] = E_{D_{n}} E_{(x_{n}, y_{n})} [e (g_{n}^{-} (x_{n}), y_{n})] = E_{D_{n}} [E_{out} (g_{n}^{-})] = \overset{ˉ}{E}_{out} (N - 1)$ . The inner expectation is unbiasedness of the validation error for $g_{n}^{-}$ (since $D_{n}$ doesn’t contain example $n$ ); the outer averages over the random training set of size $N - 1$ .

Almost unbiased for $\overset{ˉ}{E}_{out} (N)$ : practitioners say $E_{cv}$ is “almost unbiased” for $E_{out} (g)$ — strictly it estimates $\overset{ˉ}{E}_{out} (N - 1)$ , but for moderate $N$ the gap between $\overset{ˉ}{E}_{out} (N - 1)$ and $\overset{ˉ}{E}_{out} (N)$ is tiny.

Variance

The variance of $E_{cv}$ is harder to analyse: the $N$ training sets $D_{n}$ overlap (each pair shares $N - 2$ examples), so the $e_{n}$ ‘s aren’t independent. In practice $E_{cv}$ has low variance — comparable to a much larger validation set — but not provably bounded by $1/ (4 N)$ as a single-split would suggest.

Disadvantage: $N$ Trainings

$E_{cv} (H, A) = \frac{1}{N} \sum_{n = 1}^{N} e_{n}$

requires training the model $N$ times. For a deep network with hours-long training, this is infeasible. Special case: linear regression has a closed-form LOOCV that costs the same as one fit. The hat matrix $H (λ) = A (A^{⊤} A + λ I)^{- 1} A^{⊤}$ gives:

$E_{cv} = \frac{1}{N} \sum_{n = 1}^{N} (\frac{y ^ _{n} - y _{n}}{1 - H _{n, n} ( λ )})^{2}$

— evaluating LOOCV by re-using the original fit, scaled by per-point leverage. This trick works for ridge regression but rarely generalises.

V-Fold Cross-Validation

Partition $D$ into $V$ equal-size parts $D^{(1)}, \dots, D^{(V)}$ . For each fold $v$ :

Train: $g_{v}^{-} = A (D ∖ D^{(v)})$ .
Validate: $E_{val}^{(v)} (g_{v}^{-})$ on $D^{(v)}$ .

Average over folds:

$E_{cv} (H, A) = \frac{1}{V} \sum_{v = 1}^{V} E_{val}^{(v)} (g_{v}^{-}) .$

LOOCV is the special case $V = N$ . Practical rule of thumb: $V = 10$ — a good balance between estimate quality and computational cost.

Why V-Fold Is Usually Preferred Over LOOCV

Aspect	LOOCV ( $V = N$ )	V-fold ( $V = 10$ )
Trainings	$N$ — often infeasible	$V$ — manageable
Training set size	$N - 1$	$N - N / V \approx N$
Validation set size per fold	1 — high per-fold variance	$N / V$ — moderate
Bias as estimate of $E_{out} (N)$	Smaller (each fold uses $N - 1$ )	Slightly larger (each fold uses $N - N / V$ )
Variance	Often high (correlated $e_{n}$ ‘s)	Moderate

For most practical problems, $V = 10$ gives essentially the same error estimate as LOOCV at a fraction of the cost. Don’t trade $V$ -fold for LOOCV unless you have a reason to.