bias-variance-decomposition

For squared-error regression, the expected out-of-sample error decomposes as $E_{D} [E_{out} (g^{(D)})] = bias + var + σ^{2}$ , where bias measures how far the average hypothesis $\overset{g}{ˉ}$ deviates from the truth $f$ , variance measures how much individual hypotheses fluctuate around $\overset{g}{ˉ}$ , and $σ^{2}$ is irreducible noise. The model class $H$ controls bias and variance in opposite directions — the source of the underfitting/overfitting trade-off.

The Decomposition

Let $g^{(D)}$ be the hypothesis fit to a particular training set $D$ and the average hypothesis be

$\overset{g}{ˉ} (x) = E_{D} [g^{(D)} (x)]$

— what you’d get if you averaged the trained models across infinitely many independently-drawn training sets. (Operationally: $\overset{g}{ˉ} (x) \approx \frac{1}{K} \sum_{k} g^{(D_{k})} (x)$ for $K$ datasets.)

Starting from the squared-error out-of-sample error and adding/subtracting $\overset{g}{ˉ} (x)$ :

E_{D} [(g^{(D)} (x) - f (x))^{2}] = E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x) + \overset{g}{ˉ} (x) - f (x))^{2}] = var (x) E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x))^{2}] + bias (x) (\overset{g}{ˉ} (x) - f (x))^{2} .

The cross term vanishes: $E_{D} [g^{(D)} (x) - \overset{g}{ˉ} (x)] = 0$ by definition of $\overset{g}{ˉ}$ , and $\overset{g}{ˉ} (x) - f (x)$ doesn’t depend on $D$ .

Taking the expectation over $x$ as well:

$E_{D} [E_{out} (g^{(D)})] = bias + var$

with

$bias = E_{x} [(\overset{g}{ˉ} (x) - f (x))^{2}], var = E_{x} [E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x))^{2}]] .$

Interpretation

Bias asks: how far from the truth is the typical hypothesis we’d learn? Low bias means $H$ is rich enough that some hypothesis in $H$ can closely approximate $f$ , and the average of trained hypotheses is close to that approximation. High bias means $H$ is structurally too simple — even the best hypothesis is far from $f$ .

Variance asks: how much do individual hypotheses fluctuate from one training set to another? Low variance means the algorithm is stable: similar datasets produce similar hypotheses. High variance means the algorithm is fitting random structure in each dataset and producing wildly different hypotheses.

The dart-board picture summarises the four regimes:

	Low variance	High variance
Low bias	Tight cluster on bullseye (ideal)	Scattered around bullseye (right model class but unstable)
High bias	Tight cluster off bullseye (consistent but wrong)	Scattered off bullseye (worst case)

With Noise

If the target itself is noisy — $y = f (x) + ϵ (x)$ with $E [ϵ] = 0$ , $Var [ϵ] = σ^{2}$ — the same algebra produces an extra term:

$E_{D, ϵ} [E_{out} (g^{(D)})] = bias + var + σ^{2} .$

The third term is irreducible noise: even if you knew $f$ exactly, predictions $\overset{y}{^} (x) = f (x)$ would still be wrong by $σ^{2}$ on average, because the labels themselves are noisy. No amount of data, model complexity, or algorithmic cleverness reduces it.

The Trade-Off

Bias and variance pull in opposite directions as $H$ changes:

More complex $H$ : better chance of approximating $f$ closely → bias down. But more parameters fit more random training noise → variance up.
Simpler $H$ : fewer ways to fit noise → variance down. But the best hypothesis in $H$ may be a poor approximation to $f$ → bias up.

A vivid example: target $f (x) = sin (π x)$ , $N = 2$ samples, two hypothesis sets:

$H_{0} = {h (x) = b}$ (constant lines): bias $= 0.50$ , var $= 0.25$ , total $= 0.75$ .
$H_{1} = {h (x) = a x + b}$ (sloped lines): bias $= 0.21$ , var $= 1.69$ , total $= 1.90$ .

The simpler model wins despite having higher bias — its variance is so much lower with only 2 samples that the total is smaller. With $N = 5$ the picture flips: bias unchanged, but variance for $H_{1}$ drops to $0.21$ , and $H_{1}$ now wins ( $0.42$ vs $0.60$ ).

The lesson: what wins depends on $N$ . A complex model that overfits with little data may be the right choice with abundant data.

Bias-Variance vs VC

Both decompositions live underneath the generalisation bound but differ in:

	VC analysis	Bias–variance analysis
Loss	0–1 (binary error)	Squared error
Over what	Worst-case over $D$	Expectation over $D$
Bound	$E_{out} \leq E_{in} + Ω (d_{VC})$	$E_{D} [E_{out}] = bias + var + σ^{2}$
Distribution-free?	Yes	No (needs $E_{D}$ )

VC gives a uniform bound that holds with high probability; bias–variance gives a clean decomposition of the typical generalisation error and a conceptual picture of overfitting. The two complement rather than compete: VC dimension says whether learning generalises, bias–variance says what kind of error is left.

Practical Note

You can’t compute bias and variance from a single training set — they require expectation over $D$ , which means access to many datasets from the same distribution. So bias–variance is mostly a conceptual tool for designing algorithms (telling you whether more data, more capacity, or stronger regularisation is the right move), not a quantity you measure directly.

Diagnostics:

High bias (underfitting): training and test error both high and similar.
High variance (overfitting): training error low, test error much higher.
Both: both errors high and very different.

Remedies:

Diagnosis	Move
High bias	More features, more layers, less regularisation
High variance	More data, simpler model, more regularisation, ensembling
Irreducible noise floor	Accept it — the noise is the noise

generalization-bound — the worst-case complement to this average-case analysis.
vc-dimension — controls how bias and variance can simultaneously be small.
ridge-regression — directly trades bias up for variance down via L2 penalty.
learning-curve — visualises how bias and variance evolve with $N$ .

Active Recall

Derive the bias–variance decomposition by inserting $\overset{g}{ˉ} (x)$ into $(g^{(D)} (x) - f (x))^{2}$ and explain why the cross term vanishes.

Write $g^{(D)} (x) - f (x) = (g^{(D)} (x) - \overset{g}{ˉ} (x)) + (\overset{g}{ˉ} (x) - f (x))$ . Squaring: the diagonal terms give variance and bias respectively. The cross term is $2 E_{D} [(g^{(D)} (x) - \overset{g}{ˉ} (x)) (\overset{g}{ˉ} (x) - f (x))] = 2 (\overset{g}{ˉ} (x) - f (x)) E_{D} [g^{(D)} (x) - \overset{g}{ˉ} (x)]$ , which is zero because $E_{D} [g^{(D)} (x)] = \overset{g}{ˉ} (x)$ by definition. The factor $\overset{g}{ˉ} (x) - f (x)$ pulls outside because it doesn’t depend on $D$ .

For target $f (x) = sin (π x)$ with $N = 2$ samples, the constant-line hypothesis $H_{0}$ gives bias 0.5, variance 0.25; the sloped-line hypothesis $H_{1}$ gives bias 0.21, variance 1.69. Which has lower expected $E_{out}$ and what's the lesson?

$E_{out} (H_{0}) = 0.5 + 0.25 = 0.75$ ; $E_{out} (H_{1}) = 0.21 + 1.69 = 1.90$ . The constant-line wins despite having higher bias — its variance is so much smaller that the total error is lower. The lesson: with few samples, prefer simpler models even if they have higher bias. As $N$ grows, the variance of $H_{1}$ shrinks (its slope estimate stabilises) and the more flexible model eventually wins. Optimal model complexity depends on $N$ .

An algorithm has bias 0.05 and variance 0.30 on noisy data with $σ^{2} = 0.10$ . What's the expected $E_{out}$ , and which term should you focus on reducing first?

$E_{D} [E_{out}] = 0.05 + 0.30 + 0.10 = 0.45$ . Variance is the dominant reducible term, so focus on variance reduction: more data, simpler model, regularisation, or averaging multiple models (ensembling). Bias is small enough that simplifying further would only hurt. Noise is a floor — there’s no way to push below 0.10 without different (less noisy) data.

Why can you not, in general, measure bias and variance from a single training set, and what does this imply about how the decomposition is used in practice?

Bias and variance are defined as expectations over the dataset distribution $D$ — they require averaging the trained hypothesis $g^{(D)}$ over many independent draws of $D$ . With one fixed dataset, you have one $g$ and no way to estimate its expected behaviour. In practice this makes bias–variance a conceptual tool: you reason about whether your error pattern looks like high bias (under-flexible model, both errors high) or high variance (over-flexible, training error much lower than test) and choose the corresponding remedy. The actual bias and variance numbers stay implicit.

Course Notes

Explorer

bias-variance-decomposition

The Decomposition

Interpretation

With Noise

The Trade-Off

Bias-Variance vs VC

Practical Note

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

bias-variance-decomposition

The Decomposition

Interpretation

With Noise

The Trade-Off

Bias-Variance vs VC

Practical Note

Related

Active Recall

Graph View

Table of Contents

Backlinks