THE CRUX: We've spent two weeks building learning theory — VC dimension, generalisation bound, bias–variance — that explains why models overfit. Now what do we actually do about it? Specifically: how do we move from theoretical understanding (model complexity should match data) to practical procedure (this matrix, this , this code)? (1) What does overfitting actually look like, and what causes it? (2) How does the constrained-optimisation view of "smaller hypothesis set" become the unconstrained augmented error , and how does that recover ridge regression?
The two halves of week 10 each answer one. (1) Overfitting is the regime where keeps falling while rises — the model captures noise rather than structure. Driven by some combination of data scarcity, model complexity, stochastic noise ( in the labels), and deterministic noise (target complexity that the hypothesis class can’t represent). The “two-learners” experiment shows the surprising result: even when both learners know the target is degree-10, a degree-2 learner beats a degree-10 learner on test error if is small — because the right hypothesis class isn’t enough; you need data to constrain it. (2) Regularisation is the structural cure. Start from the constrained problem s.t. — geometrically, a ball-shaped hypothesis set with radius . The Lagrangian transforms this into the unconstrained augmented error , with inversely related to . Solving this for linear regression gives — exactly ridge regression. The L1 analogue, lasso, replaces the round constraint ball with a corner-bearing diamond and produces sparse solutions.
Part 1: What Overfitting Actually Looks Like
Two Pictures, One Disease
Recall from week 9 that the VC bound shows model complexity has a U-shape against : too simple → underfitting (high bias, already large), too complex → overfitting (low but blows up). The minimum is somewhere in the middle.
A canonical example, lifted from the lecture: noisy points sampled from a smooth target.
| Model | ||
|---|---|---|
| Degree-2 polynomial | ||
| Degree-10 polynomial |
The degree-10 model fits the training points to numerical accuracy and produces test error six orders of magnitude worse. The training fit is a function with extreme oscillations, hitting every training point but exploding everywhere else.
The diagnostic signature: low , high — that’s overfitting. Both errors high together is underfitting. Both errors low is the goal.
The Six Causes
The lecture lists six common drivers of overfitting. Each compounds the others, and real overfitting usually involves several.
- Model too complex. A million-parameter network on 1{,}000 examples memorises rather than generalises.
- Too little training data. A sentiment model on 50 customer reviews fits unique phrases instead of patterns.
- Too many training epochs. Validation error bottoms out, then climbs; training error keeps falling. (Early stopping is the cure.)
- Lack of regularisation. Without constraints, weights take whatever values minimise — including extreme values that fit noise.
- High-variance features or noisy labels. Random ID-number features become decision-tree splits; mislabelled examples drag the boundary.
- Poor data processing. Unstandardised inputs, leakage from validation into training, imbalanced classes.
The Two-Learners Experiment
The lecture’s most striking demonstration — and the one most worth internalising — is the comparison of two learners on the same target:
- Learner (Overfit): picks — a degree-10 polynomial.
- Learner (Restrict): picks — a degree-2 polynomial.
Both are told the truth: the target is degree-10. Yet deliberately uses an under-expressive model class. Run the experiment with training points:
| Target | ||
|---|---|---|
| Noisy degree-10 target | ||
| Noiseless degree-50 target |
Even when structurally cannot fit the target, it wins by orders of magnitude on test error.
ASIDE — A counterintuitive lesson worth pausing on
Knowing the right hypothesis class is not enough. With insufficient data, the high-capacity learner has so much variance that any bias saving evaporates. This is the bias–variance trade-off shouting at you: deliberate underfitting beats accurate complexity-matching when is small. The implication for practice: start simple, scale capacity only as data grows. Modern deep learning works precisely because it pairs huge capacity with huge — not because capacity alone is virtuous.
The two-learners experiment shows degree-2 wins even when the target is degree-10. Where, conceptually, does the win come from?
Variance. With , the degree-10 learner has 11 parameters fitting 15 points — barely constrained, so the trained polynomial swings wildly between training points and produces huge test error. The degree-2 learner has 3 parameters fitting 15 points — heavily constrained, so the fit is stable across training sets. The bias the degree-2 learner pays (it can’t represent the degree-10 truth) is dwarfed by the variance the degree-10 learner pays (it fits noise enthusiastically). At larger , the picture flips — variance shrinks and bias starts to matter — but at , low capacity wins.
Stochastic and Deterministic Noise
The bias–variance decomposition with noise has the form
The floor — irreducible stochastic noise — comes from randomness in the labels. But there’s a second kind:
Deterministic noise is the part of the target that the chosen cannot represent. Try to fit a degree-50 target with : the residual (best degree-2 approximation in ) is a fixed function of — perfectly predictable in principle, but invisible to the learner because nothing in matches it.
Both look identical to the trained model: residuals it can’t reduce. Both encourage overfitting. The four sources of “serious overfitting” (from the lecture’s heat-map experiment):
| Factor | Direction |
|---|---|
| Data size ↓ | Overfitting ↑ |
| Stochastic noise ↑ | Overfitting ↑ |
| Target complexity ↑ (deterministic noise) | Overfitting ↑ |
| Excessive model power | Overfitting ↑ |
TIP — Why high-capacity models overfit even on noiseless data
A degree-2 model fitting a noiseless degree-50 target still overfits with . The “noise” the model is fitting isn’t randomness — it’s the part of the target that exceeds the model’s representable structure, but it’s still pulling the fit in spurious directions. Even noiseless complex targets need regularisation when capacity is mismatched.
Why MLE Cannot Help
MLE minimises in-sample negative log-likelihood. There’s nothing in MLE that prefers simple hypotheses to complex ones — the likelihood always rewards in-sample fit. So MLE on a high-capacity model interpolates the training data and overfits.
The Bayesian fix: combine the likelihood with a prior over that favours simpler hypotheses. The MAP estimate balances likelihood (data fit) against prior (simplicity) — and for a Gaussian prior, MAP is exactly ridge regression. Regularisation isn’t a hack — it’s MAP estimation. The probabilistic mechanics of why we should regularise are already in the framework from week 8.
Part 2: How to Actually Put On the Brakes
From Constrained to Augmented
The VC bound’s slogan: . Two levers, both worth pulling.
The pragmatic implementation, slowly built up:
Hard constraint. Force certain weights to zero — say . This is just “use instead of ” — a discrete capacity reduction. Crude but effective.
Looser constraint. Force at least 8 of the 10 weights to be zero, but let the algorithm pick which 8: . More expressive than , less risky than . This is sparsity.
Soft constraint. Combinatorial sparsity is hard to optimise. Replace with a continuous proxy: , a closed ball of radius . As ranges from 0 to , smoothly interpolates between and .
So the regularised problem is
The Lagrangian Move
Constrained optimisation is harder than unconstrained. The Lagrangian turns one into the other.
Geometrically: at the optimum (assuming the constraint is active, ), the gradient must be parallel to the surface’s outward normal — otherwise we could slide along the surface and decrease . The normal at on the sphere is just itself, so
for some . This is exactly the gradient of the augmented error
So solving s.t. is equivalent to unconstrained minimisation of — for some that depends monotonically on (larger budget smaller penalty).
This is the two-form duality of regularisation: pick whichever form is convenient.
The Solution
For linear regression, plug into and set the gradient to zero:
Rearranging:
The unconstrained OLS gives . The regularised solution differs by a single inside the inverse — the ridge regression solution. Adding also fixes ill-conditioning: every eigenvalue of shifts up by , ensuring invertibility even when alone is singular.
The same closed form was derived in week 8 from the Bayesian MAP perspective (). Here it appears from constrained optimisation. Two derivations, same answer: regularisation is structural, with multiple equivalent justifications.
Effective VC Dimension
Why does this help generalisation? Recall the VC bound: , with growing in .
The nominal has . But the regularised algorithm only navigates within — the ball of radius — so the effective complexity is , smaller than the nominal .
The slogan: large, while small if is regularised. The hypothesis class is structurally rich, but the algorithm only uses a constrained subset — the SVM’s fat-hyperplane story from week 9, generalised.
Choosing — The U-Curve
Plot expected against : the shape is U-shaped. Sharp drop at small (overfitting cured), minimum at some , slow rise (underfitting setting in). Practical observation: the U is steep on the left and shallow on the right — better to err on the side of slightly more regularisation than too little.
The optimal depends on three things you don’t directly observe:
- Stochastic noise : more noise → more regularisation needed (more bumps, more brakes).
- Deterministic noise (target complexity): more → more regularisation.
- Data size : more → less need to regularise.
Since none of these are observable, is chosen by validation — the second of the lecture’s “two cures” and the topic of next week. Try a grid of ‘s, hold out a portion of the training data, and pick the minimising held-out error.
Why is the U-curve of vs steep on the left and shallow on the right?
On the left ( near zero), the model is barely regularised and overfitting is severe; small increases in rapidly fix the gap. On the right (large ), the model is heavily constrained and the marginal cost of more regularisation is small — you’re already close to the underfitting minimum. The asymmetry suggests that when you don’t know , biasing your choice slightly higher than your guess is safer than biasing lower — the cost of mild underfitting is much smaller than the cost of mild overfitting.
L1 vs L2 — Different Geometries
The L2 regulariser is the natural choice for differentiable, closed-form optimisation. But it’s not the only one. The general family is
The case — lasso — is qualitatively different. The constraint set is a diamond (cross-polytope) with corners on the coordinate axes. The optimum (where the elliptical contours first touch the diamond) tends to land on a corner — a sparse vector with some coordinates exactly zero.
The L2 ball is round, no corners; the optimum is generically a smooth point with all coordinates non-zero. The L1 diamond is corner-rich, with corners precisely at sparse vectors; the optimum lands there generically.
Practical implication. Use lasso when you believe the true model is sparse — most features irrelevant — and you want automatic feature selection. Use ridge when you expect a dense model where most features matter. Elastic net (L1 + L2) hybrids combine sparsity with stability under correlated features.
ASIDE — All of these have Bayesian counterparts
L2 ↔ Gaussian prior. L1 ↔ Laplace prior (sharp peak at zero, drives sparsity). Mixtures ↔ elastic net. The Bayesian view explains why L1 is sparse — the Laplace’s sharp peak at zero pins the MAP estimate there in the absence of strong data evidence — and why L2 is smooth (the Gaussian’s smooth peak doesn’t pin anything specifically). Each regulariser is a probabilistic statement about which weight vectors are a priori plausible.
Recovering the Bayesian View
The MAP estimate from week 8’s bayesian-linear-regression derivation:
is identical to ridge regression with . So:
- The constrained-optimisation derivation (this week, geometric/Lagrangian) and
- The Bayesian MAP derivation (week 8, probabilistic/prior)
both arrive at the same closed form . Each derivation illuminates a different facet:
- Constrained view: regularisation is “use a smaller hypothesis set” written as Lagrangian penalty.
- Bayesian view: regularisation is MAP estimation under a prior on .
TIP — The lecture's practical advice in one slogan
“Whenever you train a model, try including regularisation.” The benefit (variance reduction, more graceful behaviour with noise) almost always outweighs the cost (slight bias). Given a sensible from validation, regularisation is one of the lowest-effort, highest-impact moves in practical ML.
Concepts Introduced This Week
- overfitting — low , high . Caused by model complexity, data scarcity, stochastic noise, deterministic noise. The two-learners example shows that even degree-2 beats degree-10 when is small. Underfitting is the symmetric failure (both errors high).
- regularization — augmenting with a complexity penalty to bias the optimiser away from overfitting. Constrained s.t. ↔ unconstrained via Lagrangian. The augmented error is a better proxy for . Effective VC dimension is smaller than nominal when the algorithm is regularised.
- lasso-regression — L1 regularisation. The diamond-shaped constraint set has corners on coordinate axes, so the optimum is sparse. Useful when the true model is sparse; corresponds to a Laplace prior in the Bayesian view.
Connections
- Builds on week-08: the Bayesian MAP derivation and ridge-regression are the same closed form we re-derive this week from constrained optimisation. Two paths, one destination.
- Builds on week-09: the VC bound’s term motivates regularisation; the bias–variance decomposition explains why trading bias for variance is sensible. The “two-learners” experiment is bias–variance in action.
- Builds on non-linear-transformation: high-degree polynomial bases blow up the VC dimension, making regularisation essential. Without it, even moderate-degree polynomials overfit catastrophically on small .
- Sets up validation and cross-validation (week 11): we keep saying “pick by validation” — next week makes that procedure precise. Cross-validation, train/validation/test splits, model selection more broadly.
Open Questions
- How is actually chosen? “Validation” is the high-level answer; the practical procedure (k-fold CV, bias of the validation estimator, the 1-SE rule) is next week’s content.
- What other regularisation tricks does deep learning use? Dropout, batch norm, data augmentation, weight tying, early stopping — each has a “what’s the implicit ?” interpretation, often in terms of equivalent Bayesian priors. Active research.
- How to choose between L1 and L2 in practice? Heuristically: L1 if you suspect sparsity (most features irrelevant), L2 otherwise. Elastic net hedges. Cross-validate on both and let the data decide.
- Why does deep learning generalise despite minimal explicit regularisation? Implicit regularisation of SGD (toward flat minima), architectural inductive biases, and dataset-level structure — beyond classical VC + regularisation theory.