A learning curve plots and as functions of training-set size . The two curves converge as — to the noise floor for the right model class. The shape (gap and slope) reveals whether a model is underfitting, overfitting, or hitting irreducible error.
The Two Curves
For a fixed hypothesis set and learning algorithm :
- — expected training error, averaging over training sets of size .
- — expected test error of the same trained hypothesis on new data.
Plot both as functions of . Universal qualitative behaviour:
- increases with . With few points the model interpolates; adding points forces compromises.
- decreases with . More data means a more representative training set and a hypothesis that generalises better.
- Both curves converge to the same asymptote — the best achievable error within , which equals (irreducible noise) for a well-specified model.
Simple vs Complex Model
The shape of the curves changes qualitatively with model complexity:
Simple model (low VC dimension).
- Curves come together quickly — small gap even for small .
- Both stabilise high above zero error: the model is structurally limited (high bias).
- Asymptote: , with the bias contribution often dominant.
Complex model (high VC dimension).
- Curves are far apart for small — large generalisation gap. With less than , can be near zero (model interpolates) while is huge.
- Need before they meet.
- Asymptote: lower than the simple model — closer to — because expressive enough to fit .
This visualises the regime distinction:
- Underfitting region: simple model, both curves near asymptote, asymptote is high.
- Overfitting region: complex model with too small, large gap.
- Sweet spot: complex model with large enough that the gap has closed.
Linear Regression (Closed Form)
For linear regression with a noisy target , , with parameters and , the learning curve has an exact form:
Reading off:
- starts at when (perfect fit) and rises towards .
- starts above and decays towards .
- Generalisation gap: , decaying as .
Each parameter costs you a unit of generalisation gap. With , the gap is negligible and both errors sit at .
VC vs Bias–Variance View
Two equivalent ways to draw the same curves:
VC view. Shade the region between the curves as the generalisation error — what the VC bound controls. The bound is a uniform statement about how far the curves can be apart.
Bias–variance view. Draw a horizontal line at separating the area under (= bias) from the area between the curves (= variance, plus noise asymptote). Now the picture decomposes the error into the structural minimum (bias) and the data-dependent fluctuation (variance).
Both pictures live on top of the same two curves — they’re complementary lenses, not competing analyses.
Diagnostic Use
Plotting empirical learning curves (using held-out validation as a proxy for ) is one of the most useful debugging tools in ML:
- Both curves high, close together: high bias. Try a more expressive model.
- Training curve low, validation curve high: high variance. Get more data, regularise, or simplify.
- Both flatten out at the same value: hitting the noise floor — diminishing returns from more data.
- Curves still descending at the right edge: more data would help.
This is also how people decide whether to invest in collecting more data: if the validation curve has plateaued, more data won’t help; if it’s still falling, it might.
Related
- bias-variance-decomposition — the conceptual framework for what learning curves show.
- generalization-bound — the worst-case bound on the gap between curves.
- vc-dimension — controls the size of the gap.
- ridge-regression — regularisation flattens the variance contribution, narrowing the gap.
Active Recall
For linear regression with parameters and noisy target (), the expected in-sample error is . Why does it start at zero when and rise rather than fall as increases?
When , the model has exactly enough parameters to interpolate points — . As grows, the model can no longer interpolate every point, so the residuals are no longer zero on average. The factor captures the fraction of “noise variance left over” after the OLS fit absorbs degrees of freedom. As , — the irreducible noise floor.
Why are the in-sample and out-of-sample learning curves of a complex model far apart for small , but close together for large ?
A complex model has many parameters; with few training points, it can fit the training data essentially perfectly ( near zero) by aligning to noise. The same model on new data would produce predictions inconsistent with the noisy fit, so is large. As grows past the model’s effective capacity (), the model can no longer interpolate, rises towards the noise floor, falls towards the same floor, and the gap closes. The generalisation gap formula for linear regression is a clean instance: gap .
You plot empirical learning curves and find that both training and validation error are about 0.30 and very close together, even after . Diagnose the issue and suggest a remedy.
Both errors high and close together is the classical high-bias (underfitting) signature. The model class isn’t expressive enough to capture the structure of — the average hypothesis is far from the truth. Adding more data won’t help (the curves have already converged). Remedies: increase model capacity (more parameters, deeper network, higher-degree polynomial, richer kernel), reduce regularisation, or add useful features. If after these the error is still 0.30, you may be hitting the irreducible noise floor — though 0.30 is high enough that you should suspect bias first.