For squared-error regression, the expected out-of-sample error decomposes as , where bias measures how far the average hypothesis deviates from the truth , variance measures how much individual hypotheses fluctuate around , and is irreducible noise. The model class controls bias and variance in opposite directions — the source of the underfitting/overfitting trade-off.
The Decomposition
Let be the hypothesis fit to a particular training set and the average hypothesis be
— what you’d get if you averaged the trained models across infinitely many independently-drawn training sets. (Operationally: for datasets.)
Starting from the squared-error out-of-sample error and adding/subtracting :
The cross term vanishes: by definition of , and doesn’t depend on .
Taking the expectation over as well:
with
Interpretation
Bias asks: how far from the truth is the typical hypothesis we’d learn? Low bias means is rich enough that some hypothesis in can closely approximate , and the average of trained hypotheses is close to that approximation. High bias means is structurally too simple — even the best hypothesis is far from .
Variance asks: how much do individual hypotheses fluctuate from one training set to another? Low variance means the algorithm is stable: similar datasets produce similar hypotheses. High variance means the algorithm is fitting random structure in each dataset and producing wildly different hypotheses.
The dart-board picture summarises the four regimes:
| Low variance | High variance | |
|---|---|---|
| Low bias | Tight cluster on bullseye (ideal) | Scattered around bullseye (right model class but unstable) |
| High bias | Tight cluster off bullseye (consistent but wrong) | Scattered off bullseye (worst case) |
With Noise
If the target itself is noisy — with , — the same algebra produces an extra term:
The third term is irreducible noise: even if you knew exactly, predictions would still be wrong by on average, because the labels themselves are noisy. No amount of data, model complexity, or algorithmic cleverness reduces it.
The Trade-Off
Bias and variance pull in opposite directions as changes:
- More complex : better chance of approximating closely → bias down. But more parameters fit more random training noise → variance up.
- Simpler : fewer ways to fit noise → variance down. But the best hypothesis in may be a poor approximation to → bias up.
A vivid example: target , samples, two hypothesis sets:
- (constant lines): bias , var , total .
- (sloped lines): bias , var , total .
The simpler model wins despite having higher bias — its variance is so much lower with only 2 samples that the total is smaller. With the picture flips: bias unchanged, but variance for drops to , and now wins ( vs ).
The lesson: what wins depends on . A complex model that overfits with little data may be the right choice with abundant data.
Bias-Variance vs VC
Both decompositions live underneath the generalisation bound but differ in:
| VC analysis | Bias–variance analysis | |
|---|---|---|
| Loss | 0–1 (binary error) | Squared error |
| Over what | Worst-case over | Expectation over |
| Bound | ||
| Distribution-free? | Yes | No (needs ) |
VC gives a uniform bound that holds with high probability; bias–variance gives a clean decomposition of the typical generalisation error and a conceptual picture of overfitting. The two complement rather than compete: VC dimension says whether learning generalises, bias–variance says what kind of error is left.
Practical Note
You can’t compute bias and variance from a single training set — they require expectation over , which means access to many datasets from the same distribution. So bias–variance is mostly a conceptual tool for designing algorithms (telling you whether more data, more capacity, or stronger regularisation is the right move), not a quantity you measure directly.
Diagnostics:
- High bias (underfitting): training and test error both high and similar.
- High variance (overfitting): training error low, test error much higher.
- Both: both errors high and very different.
Remedies:
| Diagnosis | Move |
|---|---|
| High bias | More features, more layers, less regularisation |
| High variance | More data, simpler model, more regularisation, ensembling |
| Irreducible noise floor | Accept it — the noise is the noise |
Related
- generalization-bound — the worst-case complement to this average-case analysis.
- vc-dimension — controls how bias and variance can simultaneously be small.
- ridge-regression — directly trades bias up for variance down via L2 penalty.
- learning-curve — visualises how bias and variance evolve with .
Active Recall
Derive the bias–variance decomposition by inserting into and explain why the cross term vanishes.
Write . Squaring: the diagonal terms give variance and bias respectively. The cross term is , which is zero because by definition. The factor pulls outside because it doesn’t depend on .
For target with samples, the constant-line hypothesis gives bias 0.5, variance 0.25; the sloped-line hypothesis gives bias 0.21, variance 1.69. Which has lower expected and what's the lesson?
; . The constant-line wins despite having higher bias — its variance is so much smaller that the total error is lower. The lesson: with few samples, prefer simpler models even if they have higher bias. As grows, the variance of shrinks (its slope estimate stabilises) and the more flexible model eventually wins. Optimal model complexity depends on .
An algorithm has bias 0.05 and variance 0.30 on noisy data with . What's the expected , and which term should you focus on reducing first?
. Variance is the dominant reducible term, so focus on variance reduction: more data, simpler model, regularisation, or averaging multiple models (ensembling). Bias is small enough that simplifying further would only hurt. Noise is a floor — there’s no way to push below 0.10 without different (less noisy) data.
Why can you not, in general, measure bias and variance from a single training set, and what does this imply about how the decomposition is used in practice?
Bias and variance are defined as expectations over the dataset distribution — they require averaging the trained hypothesis over many independent draws of . With one fixed dataset, you have one and no way to estimate its expected behaviour. In practice this makes bias–variance a conceptual tool: you reason about whether your error pattern looks like high bias (under-flexible model, both errors high) or high variance (over-flexible, training error much lower than test) and choose the corresponding remedy. The actual bias and variance numbers stay implicit.