Repeating the validation procedure across multiple held-out subsets (“folds”) and averaging the per-fold validation errors. Leave-one-out CV (LOOCV) holds out one example at a time, training on and evaluating on 1, repeated times: . V-fold CV partitions into equal parts, trains on of them, validates on the remaining one, repeats times. Typical . is an (almost) unbiased estimate of , and dramatically lower-variance than a single train/val split.

The Motivation

Single-split validation forces a trade-off: large means a precise validation estimate but a poorly-trained ; small means a well-trained but a noisy validation estimate. Cross-validation circumvents the trade-off by using every example for both training and validation, just not at the same time.

The cost: more computation (training or times instead of once) — but one estimate of that’s both unbiased and low-variance.

Leave-One-Out Cross-Validation (LOOCV)

For each :

  1. Define — all data except example .
  2. Train: .
  3. Compute the single-point error: .

Average over all runs:

Each is the validation error of a model trained on examples evaluated on the held-out one. LOOCV uses every example as both training (in runs) and validation (in 1 run).

Unbiasedness

Theorem. is an unbiased estimate of — the expected when training with points.

Proof. Linearity of expectation gives (by symmetry, all are equal). Decompose: . The inner expectation is unbiasedness of the validation error for (since doesn’t contain example ); the outer averages over the random training set of size .

Almost unbiased for : practitioners say is “almost unbiased” for — strictly it estimates , but for moderate the gap between and is tiny.

Variance

The variance of is harder to analyse: the training sets overlap (each pair shares examples), so the ‘s aren’t independent. In practice has low variance — comparable to a much larger validation set — but not provably bounded by as a single-split would suggest.

Disadvantage: Trainings

requires training the model times. For a deep network with hours-long training, this is infeasible. Special case: linear regression has a closed-form LOOCV that costs the same as one fit. The hat matrix gives:

— evaluating LOOCV by re-using the original fit, scaled by per-point leverage. This trick works for ridge regression but rarely generalises.

V-Fold Cross-Validation

Partition into equal-size parts . For each fold :

  1. Train: .
  2. Validate: on .

Average over folds:

LOOCV is the special case . Practical rule of thumb: — a good balance between estimate quality and computational cost.

Why V-Fold Is Usually Preferred Over LOOCV

AspectLOOCV ()V-fold ()
Trainings — often infeasible — manageable
Training set size
Validation set size per fold1 — high per-fold variance — moderate
Bias as estimate of Smaller (each fold uses )Slightly larger (each fold uses )
VarianceOften high (correlated ‘s)Moderate

For most practical problems, gives essentially the same error estimate as LOOCV at a fraction of the cost. Don’t trade -fold for LOOCV unless you have a reason to.

Cross-Validation for Model Selection

Same protocol as single-split validation, but each candidate is scored by instead of :

  1. For each candidate (or each ), compute .
  2. Pick .
  3. Retrain on all examples and report it.

This is how you actually choose the regularisation parameter for ridge or lasso — try a grid, score each via cross-validation, pick the minimum.

Choosing

Use
Quick checks, large datasets
Default — strikes the balance
(LOOCV)Small , or when LOOCV has a closed form (e.g., linear regression)
large but Specific computational constraints

The choice rarely matters much for moderate ; the standard advice is start with 10-fold and only deviate with reason.

  • validation — the single-split version that cross-validation generalises.
  • regularization — the typical hyperparameter family that cross-validation chooses among.
  • ridge-regression — the special case where LOOCV has a closed form via the hat matrix.
  • generalization-bound — the worst-case theory that cross-validation refines empirically.

Active Recall