Augmenting the training objective with a complexity penalty , scaled by , to bias the optimiser towards simpler hypotheses. The constrained form s.t. and the unconstrained form are Lagrangian-equivalent. The augmented objective is a better proxy for than alone, with standing in for the model-complexity penalty in the VC bound.
The Motivation
The VC bound says
where is the complexity of the hypothesis set. Two ways to keep small:
- Make small — fit the training data.
- Make small — use a simpler hypothesis set.
Pure ERM (empirical risk minimisation) only attacks the first. Regularisation attacks both: it minimises subject to a constraint on hypothesis complexity, or equivalently, minimises where measures the complexity of an individual hypothesis (a proxy for the set that the algorithm actually selects from).
The Constrained View
The regularisation idea, slowly introduced:
Hard constraint. Force certain weights to zero: minimise subject to . This is equivalent to using instead of — a discrete jump in capacity.
Looser constraint. Force at least 8 of to zero — but let the algorithm decide which ones. Now . More expressive than (you can choose which dimensions matter), less risky than (still constrained). Mathematically a sparsity constraint.
Soft constraint. The combinatorial sparsity is hard to optimise. Replace with a continuous proxy:
The hypothesis set is a closed ball of radius . As ranges from to , smoothly interpolates between and . The training problem becomes
From Constraint to Lagrangian
Constrained optimisation is harder than unconstrained. The Lagrangian turns it into the latter.
Geometrically: the optimal either sits inside the ball (, the constraint is inactive) or on the surface (, active). When active, two facts:
- is perpendicular to the surface (otherwise we could slide along the surface and decrease without violating the constraint).
- The normal to the surface at is the vector itself.
Combining: is parallel to , i.e.,
for some . This is exactly the gradient of the augmented error
So solving the constrained problem with budget is equivalent to unconstrained minimisation of with some specific . The correspondence: ↔ — bigger budget means lighter penalty.
TIP — Two-form duality is convenient
The constrained form is interpretable (“keep below ”); the augmented form is solvable (just gradient-step on ). Use the form that fits your purpose. They give the same for paired values.
The Augmented Error
The two components:
- — the regulariser: a function of the hypothesis itself (e.g., ).
- — the regularisation parameter: how aggressively to apply the brakes.
Heuristic interpretation. stands in for the hypothesis-set complexity in the VC bound. We can’t bound directly during training (it’s a property of the whole set), but for the chosen is computable and correlates with what hypothesis-set complexity we’d need to “actually contain” .
Practical interpretation. Minimising uses a better proxy for than alone. We can’t measure , but we can simulate the bound’s two terms (data fit + complexity) and minimise their sum.
Effective VC Dimension
The nominal hypothesis set has — all weight vectors are considered candidates. But after regularisation, the algorithm only navigates within — the ball of radius — so the effective VC dimension
is smaller. This is the formal sense in which regularised algorithms generalise: they implicitly select from a smaller hypothesis set than the nominal , even though the nominal remains expressive enough to capture complex targets when regularisation is loose.
The slogan: large, while small if is regularised.
L2: Weight Decay
The most common regulariser is
For linear regression with this penalty, the augmented error is
Setting :
Compare to the OLS solution : the only change is added inside the inverse. This is ridge regression.
The L2 penalty is called weight decay because, in iterative optimisers, each step subtracts a multiple of — every weight “decays” toward zero. Larger means stronger decay, shorter , smaller effective .
Choosing
The trade-off is sensitive: too little regularisation lets overfitting through, too much produces underfitting.
The shape of expected vs is U-shaped: a sharp drop at small (fixing overfitting), a minimum at some optimal , then a slow rise (underfitting). The optimal depends on:
- Stochastic noise. More noise → larger (more brakes for a bumpier road).
- Deterministic noise (target complexity beyond ). More → larger .
- Data size . More data → smaller (less need to regularise heavily).
In practice is unknown — neither nor target complexity is observable. The standard procedure is validation: try a grid of values, hold out part of the training data, and pick the minimising held-out error. This is Lec 1’s cross-validation half of the “two cures” — covered in detail next week.
General Regularisers
The L2 norm isn’t the only choice. The general family is
| Name | Effect | |
|---|---|---|
| weight decay / ridge | Shrinks weights smoothly toward 0 | |
| lasso (sparsity) | Drives some weights to exactly 0 | |
| (non-convex) | Aggressive sparsity, harder to optimise |
L2 has a unique closed-form solution and is differentiable everywhere; L1 is non-differentiable at zero but produces sparse solutions, making it the right choice when you suspect the true model is sparse (most coefficients exactly zero). The regularisers are non-convex and rarely used outside specialised contexts.
The Bayesian view: each regulariser corresponds to a prior on . L2 ↔ Gaussian, L1 ↔ Laplace, ↔ super-Gaussian (heavy tails and sharp peaks).
Practical Tips
The lecture’s “tricks and tips”:
- Try regularisation by default. It rarely hurts if is reasonable. Sweep a grid and let validation pick.
- Higher noise → more regularisation. Heuristic argument: noise is “high frequency”, complex targets are also high frequency, so a low-frequency-favouring regulariser helps.
- Modern deep learning is full of regularisation. L2 weight decay, dropout, batch norm, data augmentation, early stopping — all variants of the same idea. Different ‘s, different ‘s.
Related
- ridge-regression — L2 regularisation specifically applied to linear regression.
- lasso-regression — L1 regularisation; sparse solutions.
- bayesian-linear-regression — regularisation as MAP estimation under a prior.
- generalization-bound — the VC bound that motivates regularisation.
- overfitting — the disease that regularisation cures.
- lagrangian — the constrained-to-unconstrained translation underlying .
Active Recall
Show that the constrained optimisation s.t. is equivalent to the unconstrained for some . Explain the geometric intuition.
At the constrained optimum (assuming the constraint is active, i.e. on the surface ), must be parallel to the outward normal of the surface — otherwise we could move along the surface to decrease . The outward normal at is itself, so , i.e. for some . This is exactly the stationarity condition for . The correspondence is monotone: larger (looser constraint) ↔ smaller (lighter penalty).
Why is the augmented error a better proxy for than alone?
The VC bound has two terms: (fit) and (set complexity). Pure ERM minimises only the first; depends on both. adds a term proxying the complexity penalty, so minimising it minimises a closer approximation to the upper bound on . The proxy is heuristic — for the chosen weights, not for the whole set — but for “well-chosen” regularisers (e.g., ) it correlates with the hypothesis-set complexity that the algorithm effectively uses.
Compute the closed-form regularised solution for L2 linear regression. Why does adding to also fix numerical conditioning issues?
. Multiplying through and rearranging: , so . Adding shifts every eigenvalue of up by , so a near-zero eigenvalue (the source of ill-conditioning) becomes , well away from zero. The condition number — ratio of largest to smallest eigenvalue — improves, and the inverse is numerically stable even when alone would be singular or near-singular.
What is the effective VC dimension of an L2-regularised hypothesis set, and why is it usually smaller than the nominal ?
The nominal counts all weight vectors as candidates. With a regulariser, the algorithm only converges to weights inside the ball — a strictly smaller set, with smaller VC dimension. The effective VC dimension is , where is the implicit budget set by . So even though the nominal model class is highly expressive, the regularised algorithm effectively selects from a smaller class — explaining why the regularised generalisation gap is smaller than a naive analysis would predict.
What's the practical procedure for choosing , and why can't you read it off the data directly?
depends on (stochastic noise), target complexity (deterministic noise), and — none of which is directly observable. The standard procedure is validation: try a grid of values, train on a portion of the training data, evaluate held-out error on the remainder, and pick the minimising mean held-out error (often via -fold cross-validation). The held-out error is an unbiased estimate of , so the chosen is the one that empirically generalises best — without requiring knowledge of the noise levels.