Repeating the validation procedure across multiple held-out subsets (“folds”) and averaging the per-fold validation errors. Leave-one-out CV (LOOCV) holds out one example at a time, training on and evaluating on 1, repeated times: . V-fold CV partitions into equal parts, trains on of them, validates on the remaining one, repeats times. Typical . is an (almost) unbiased estimate of , and dramatically lower-variance than a single train/val split.
The Motivation
Single-split validation forces a trade-off: large means a precise validation estimate but a poorly-trained ; small means a well-trained but a noisy validation estimate. Cross-validation circumvents the trade-off by using every example for both training and validation, just not at the same time.
The cost: more computation (training or times instead of once) — but one estimate of that’s both unbiased and low-variance.
Leave-One-Out Cross-Validation (LOOCV)
For each :
- Define — all data except example .
- Train: .
- Compute the single-point error: .
Average over all runs:
Each is the validation error of a model trained on examples evaluated on the held-out one. LOOCV uses every example as both training (in runs) and validation (in 1 run).
Unbiasedness
Theorem. is an unbiased estimate of — the expected when training with points.
Proof. Linearity of expectation gives (by symmetry, all are equal). Decompose: . The inner expectation is unbiasedness of the validation error for (since doesn’t contain example ); the outer averages over the random training set of size .
Almost unbiased for : practitioners say is “almost unbiased” for — strictly it estimates , but for moderate the gap between and is tiny.
Variance
The variance of is harder to analyse: the training sets overlap (each pair shares examples), so the ‘s aren’t independent. In practice has low variance — comparable to a much larger validation set — but not provably bounded by as a single-split would suggest.
Disadvantage: Trainings
requires training the model times. For a deep network with hours-long training, this is infeasible. Special case: linear regression has a closed-form LOOCV that costs the same as one fit. The hat matrix gives:
— evaluating LOOCV by re-using the original fit, scaled by per-point leverage. This trick works for ridge regression but rarely generalises.
V-Fold Cross-Validation
Partition into equal-size parts . For each fold :
- Train: .
- Validate: on .
Average over folds:
LOOCV is the special case . Practical rule of thumb: — a good balance between estimate quality and computational cost.
Why V-Fold Is Usually Preferred Over LOOCV
| Aspect | LOOCV () | V-fold () |
|---|---|---|
| Trainings | — often infeasible | — manageable |
| Training set size | ||
| Validation set size per fold | 1 — high per-fold variance | — moderate |
| Bias as estimate of | Smaller (each fold uses ) | Slightly larger (each fold uses ) |
| Variance | Often high (correlated ‘s) | Moderate |
For most practical problems, gives essentially the same error estimate as LOOCV at a fraction of the cost. Don’t trade -fold for LOOCV unless you have a reason to.
Cross-Validation for Model Selection
Same protocol as single-split validation, but each candidate is scored by instead of :
- For each candidate (or each ), compute .
- Pick .
- Retrain on all examples and report it.
This is how you actually choose the regularisation parameter for ridge or lasso — try a grid, score each via cross-validation, pick the minimum.
Choosing
| Use | |
|---|---|
| Quick checks, large datasets | |
| Default — strikes the balance | |
| (LOOCV) | Small , or when LOOCV has a closed form (e.g., linear regression) |
| large but | Specific computational constraints |
The choice rarely matters much for moderate ; the standard advice is start with 10-fold and only deviate with reason.
Related
- validation — the single-split version that cross-validation generalises.
- regularization — the typical hyperparameter family that cross-validation chooses among.
- ridge-regression — the special case where LOOCV has a closed form via the hat matrix.
- generalization-bound — the worst-case theory that cross-validation refines empirically.
Active Recall
Define leave-one-out cross-validation error and explain why it requires separate training runs.
, where is the model trained on — all data except example . Each is a different fit (different training set), so producing all values requires running the learning algorithm times. This is computationally expensive for any model whose training takes more than seconds. The exception is linear regression, where the LOOCV error has a closed-form expression in terms of the hat matrix that re-uses the single fit on all points.
Why is LOOCV said to be "almost unbiased" for rather than exactly unbiased?
Strictly, the theorem says — the expected out-of-sample error when training with examples. The hypothesis we actually report, , is trained on all examples, so its expected error is , which is slightly smaller than (more data → less error on average). The gap is small for moderate — typically the difference is dominated by the noise in itself — so practitioners treat LOOCV as essentially unbiased for .
For 10-fold cross-validation on a dataset of examples, how many examples are used to train each fold's model, how many to validate, and how many total trainings are required?
, so each fold has validation examples and training examples. The full procedure trains the model times, once per fold, each time on a different 900-example subset. The final is the average of 10 per-fold validation errors. After model selection picks the best , you’d typically retrain once more on all 1000 examples to produce the final hypothesis.
Why is -fold cross-validation usually preferred over LOOCV in practice?
Three reasons:
- Cost: V-fold requires trainings (typically 10), LOOCV requires (often hundreds to millions). For non-trivial models, LOOCV is infeasible.
- Variance: LOOCV’s per-fold errors are highly correlated (each pair of training sets differs in only 2 examples), so its variance can be paradoxically higher than V-fold for some problems.
- Bias: V-fold’s training sets are slightly smaller than LOOCV’s, but for moderate the bias gap is negligible.
The default recommendation is — only switch to LOOCV when you have a closed-form trick (linear regression) or specific reason.