For squared-error regression, the expected out-of-sample error decomposes as , where bias measures how far the average hypothesis deviates from the truth , variance measures how much individual hypotheses fluctuate around , and is irreducible noise. The model class controls bias and variance in opposite directions — the source of the underfitting/overfitting trade-off.

The Decomposition

Let be the hypothesis fit to a particular training set and the average hypothesis be

— what you’d get if you averaged the trained models across infinitely many independently-drawn training sets. (Operationally: for datasets.)

Starting from the squared-error out-of-sample error and adding/subtracting :

The cross term vanishes: by definition of , and doesn’t depend on .

Taking the expectation over as well:

with

Interpretation

Bias asks: how far from the truth is the typical hypothesis we’d learn? Low bias means is rich enough that some hypothesis in can closely approximate , and the average of trained hypotheses is close to that approximation. High bias means is structurally too simple — even the best hypothesis is far from .

Variance asks: how much do individual hypotheses fluctuate from one training set to another? Low variance means the algorithm is stable: similar datasets produce similar hypotheses. High variance means the algorithm is fitting random structure in each dataset and producing wildly different hypotheses.

The dart-board picture summarises the four regimes:

Low varianceHigh variance
Low biasTight cluster on bullseye (ideal)Scattered around bullseye (right model class but unstable)
High biasTight cluster off bullseye (consistent but wrong)Scattered off bullseye (worst case)

With Noise

If the target itself is noisy — with , — the same algebra produces an extra term:

The third term is irreducible noise: even if you knew exactly, predictions would still be wrong by on average, because the labels themselves are noisy. No amount of data, model complexity, or algorithmic cleverness reduces it.

The Trade-Off

Bias and variance pull in opposite directions as changes:

  • More complex : better chance of approximating closely → bias down. But more parameters fit more random training noise → variance up.
  • Simpler : fewer ways to fit noise → variance down. But the best hypothesis in may be a poor approximation to bias up.

A vivid example: target , samples, two hypothesis sets:

  • (constant lines): bias , var , total .
  • (sloped lines): bias , var , total .

The simpler model wins despite having higher bias — its variance is so much lower with only 2 samples that the total is smaller. With the picture flips: bias unchanged, but variance for drops to , and now wins ( vs ).

The lesson: what wins depends on . A complex model that overfits with little data may be the right choice with abundant data.

Bias-Variance vs VC

Both decompositions live underneath the generalisation bound but differ in:

VC analysisBias–variance analysis
Loss0–1 (binary error)Squared error
Over whatWorst-case over Expectation over
Bound
Distribution-free?YesNo (needs )

VC gives a uniform bound that holds with high probability; bias–variance gives a clean decomposition of the typical generalisation error and a conceptual picture of overfitting. The two complement rather than compete: VC dimension says whether learning generalises, bias–variance says what kind of error is left.

Practical Note

You can’t compute bias and variance from a single training set — they require expectation over , which means access to many datasets from the same distribution. So bias–variance is mostly a conceptual tool for designing algorithms (telling you whether more data, more capacity, or stronger regularisation is the right move), not a quantity you measure directly.

Diagnostics:

  • High bias (underfitting): training and test error both high and similar.
  • High variance (overfitting): training error low, test error much higher.
  • Both: both errors high and very different.

Remedies:

DiagnosisMove
High biasMore features, more layers, less regularisation
High varianceMore data, simpler model, more regularisation, ensembling
Irreducible noise floorAccept it — the noise is the noise

Active Recall