regularization-ml

Augmenting the training objective $E_{in} (w)$ with a complexity penalty $Ω (w)$ , scaled by $λ \geq 0$ , to bias the optimiser towards simpler hypotheses. The constrained form $min E_{in}$ s.t. $Ω (w) \leq C$ and the unconstrained form $min E_{in} + (λ / N) Ω (w)$ are Lagrangian-equivalent. The augmented objective is a better proxy for $E_{out}$ than $E_{in}$ alone, with $Ω (w)$ standing in for the model-complexity penalty $Ω (H)$ in the VC bound.

The Motivation

The VC bound says

$E_{out} (h) \leq E_{in} (h) + Ω (H)$

where $Ω (H)$ is the complexity of the hypothesis set. Two ways to keep $E_{out}$ small:

Make $E_{in}$ small — fit the training data.
Make $Ω (H)$ small — use a simpler hypothesis set.

Pure ERM (empirical risk minimisation) only attacks the first. Regularisation attacks both: it minimises $E_{in}$ subject to a constraint on hypothesis complexity, or equivalently, minimises $E_{in} + (λ / N) Ω (w)$ where $Ω (w)$ measures the complexity of an individual hypothesis (a proxy for the set $Ω (H)$ that the algorithm actually selects from).

The Constrained View

The regularisation idea, slowly introduced:

Hard constraint. Force certain weights to zero: minimise $E_{in}$ subject to $w_{3} = w_{4} = \dots = w_{10} = 0$ . This is equivalent to using $H_{2}$ instead of $H_{10}$ — a discrete jump in capacity.

Looser constraint. Force at least 8 of $w_{q}$ to zero — but let the algorithm decide which ones. Now $H_{2}^{'} = {w : at least 8 of w_{q} = 0}$ . More expressive than $H_{2}$ (you can choose which dimensions matter), less risky than $H_{10}$ (still constrained). Mathematically a sparsity constraint.

Soft constraint. The combinatorial sparsity is hard to optimise. Replace with a continuous proxy:

$H (C) = {w : w^{⊤} w \leq C} .$

The hypothesis set is a closed ball of radius $C$ . As $C$ ranges from $0$ to $\infty$ , $H (C)$ smoothly interpolates between ${0}$ and $R^{Q + 1}$ . The training problem becomes

$min_{w} E_{in} (w) subject to w^{⊤} w \leq C .$

From Constraint to Lagrangian

Constrained optimisation is harder than unconstrained. The Lagrangian turns it into the latter.

Geometrically: the optimal $w_{REG}$ either sits inside the ball ( $w^{⊤} w < C$ , the constraint is inactive) or on the surface ( $w^{⊤} w = C$ , active). When active, two facts:

$\nabla E_{in} (w_{REG})$ is perpendicular to the surface (otherwise we could slide along the surface and decrease $E_{in}$ without violating the constraint).
The normal to the surface $w^{⊤} w = C$ at $w_{REG}$ is the vector $w_{REG}$ itself.

Combining: $- \nabla E_{in} (w_{REG})$ is parallel to $w_{REG}$ , i.e.,

$\nabla E_{in} (w_{REG}) + \frac{2 λ}{N} w_{REG} = 0$

for some $λ \geq 0$ . This is exactly the gradient of the augmented error

$E_{aug} (w) = E_{in} (w) + \frac{λ}{N} w^{⊤} w .$

So solving the constrained problem with budget $C$ is equivalent to unconstrained minimisation of $E_{aug}$ with some specific $λ$ . The correspondence: $C ↑$ ↔ $λ ↓$ — bigger budget means lighter penalty.

TIP — Two-form duality is convenient

The constrained form is interpretable (“keep $∥ w ∥^{2}$ below $C$ ”); the augmented form is solvable (just gradient-step on $E_{aug}$ ). Use the form that fits your purpose. They give the same $w_{REG}$ for paired $C, λ$ values.

The Augmented Error

$E_{aug} (w) = E_{in} (w) + \frac{λ}{N} Ω (w)$

The two components:

$Ω (w)$ — the regulariser: a function of the hypothesis itself (e.g., $w^{⊤} w$ ).
$λ$ — the regularisation parameter: how aggressively to apply the brakes.

Heuristic interpretation. $Ω (w)$ stands in for the hypothesis-set complexity $Ω (H)$ in the VC bound. We can’t bound $Ω (H)$ directly during training (it’s a property of the whole set), but $Ω (w)$ for the chosen $w$ is computable and correlates with what hypothesis-set complexity we’d need to “actually contain” $w$ .

Practical interpretation. Minimising $E_{aug}$ uses a better proxy for $E_{out}$ than $E_{in}$ alone. We can’t measure $E_{out}$ , but we can simulate the bound’s two terms (data fit + complexity) and minimise their sum.

Effective VC Dimension

The nominal hypothesis set $H = R^{\tilde{d} + 1}$ has $d_{VC} = \tilde{d} + 1$ — all weight vectors are considered candidates. But after regularisation, the algorithm only navigates within $H (C)$ — the ball of radius $C$ — so the effective VC dimension

$d_{EFF} (H, A) = d_{VC} (H (C))$

is smaller. This is the formal sense in which regularised algorithms generalise: they implicitly select from a smaller hypothesis set than the nominal $H$ , even though the nominal $H$ remains expressive enough to capture complex targets when regularisation is loose.

The slogan: $d_{VC} (H)$ large, while $d_{EFF} (H, A)$ small if $A$ is regularised.

L2: Weight Decay

The most common regulariser is

$Ω (w) = w^{⊤} w = ∥ w ∥_{2}^{2} .$

For linear regression with this penalty, the augmented error is

$E_{aug} (w) = \frac{1}{N} (Zw - y)^{⊤} (Zw - y) + \frac{λ}{N} w^{⊤} w .$

Setting $\nabla E_{aug} = 0$ :

$w_{REG} = (Z^{⊤} Z + λ I)^{- 1} Z^{⊤} y .$

Compare to the OLS solution $w_{lin} = (Z^{⊤} Z)^{- 1} Z^{⊤} y$ : the only change is $λ I$ added inside the inverse. This is ridge regression.

The L2 penalty is called weight decay because, in iterative optimisers, each step subtracts a multiple of $w$ — every weight “decays” toward zero. Larger $λ$ means stronger decay, shorter $w$ , smaller effective $C$ .

Choosing $λ$

The trade-off is sensitive: too little regularisation lets overfitting through, too much produces underfitting.

The shape of expected $E_{out}$ vs $λ$ is U-shaped: a sharp drop at small $λ$ (fixing overfitting), a minimum at some optimal $λ^{*}$ , then a slow rise (underfitting). The optimal $λ^{*}$ depends on:

Stochastic noise. More noise → larger $λ^{*}$ (more brakes for a bumpier road).
Deterministic noise (target complexity beyond $H$ ). More → larger $λ^{*}$ .
Data size $N$ . More data → smaller $λ^{*}$ (less need to regularise heavily).

In practice $λ^{*}$ is unknown — neither $σ^{2}$ nor target complexity is observable. The standard procedure is validation: try a grid of $λ$ values, hold out part of the training data, and pick the $λ$ minimising held-out error. This is Lec 1’s cross-validation half of the “two cures” — covered in detail next week.

General Regularisers

The L2 norm isn’t the only choice. The general $L_{p}$ family is

$Ω (w) = \sum_{i} ∣ w_{i} ∣^{p} .$

$p$	Name	Effect
$p = 2$	weight decay / ridge	Shrinks weights smoothly toward 0
$p = 1$	lasso (sparsity)	Drives some weights to exactly 0
$p < 1$	(non-convex)	Aggressive sparsity, harder to optimise

L2 has a unique closed-form solution and is differentiable everywhere; L1 is non-differentiable at zero but produces sparse solutions, making it the right choice when you suspect the true model is sparse (most coefficients exactly zero). The $p < 1$ regularisers are non-convex and rarely used outside specialised contexts.

The Bayesian view: each regulariser corresponds to a prior on $w$ . L2 ↔ Gaussian, L1 ↔ Laplace, $p < 1$ ↔ super-Gaussian (heavy tails and sharp peaks).

Practical Tips

The lecture’s “tricks and tips”:

Try regularisation by default. It rarely hurts if $λ$ is reasonable. Sweep a grid and let validation pick.
Higher noise → more regularisation. Heuristic argument: noise is “high frequency”, complex targets are also high frequency, so a low-frequency-favouring regulariser helps.
Modern deep learning is full of regularisation. L2 weight decay, dropout, batch norm, data augmentation, early stopping — all variants of the same idea. Different $Ω$ ‘s, different $λ$ ‘s.

ridge-regression — L2 regularisation specifically applied to linear regression.
lasso-regression — L1 regularisation; sparse solutions.
bayesian-linear-regression — regularisation as MAP estimation under a prior.
generalization-bound — the VC bound that motivates regularisation.
overfitting — the disease that regularisation cures.
lagrangian — the constrained-to-unconstrained translation underlying $E_{aug}$ .

Active Recall

Show that the constrained optimisation $min E_{in}$ s.t. $w^{⊤} w \leq C$ is equivalent to the unconstrained $min E_{in} (w) + (λ / N) w^{⊤} w$ for some $λ \geq 0$ . Explain the geometric intuition.

At the constrained optimum $w_{REG}$ (assuming the constraint is active, i.e. on the surface $w^{⊤} w = C$ ), $- \nabla E_{in}$ must be parallel to the outward normal of the surface — otherwise we could move along the surface to decrease $E_{in}$ . The outward normal at $w_{REG}$ is $w_{REG}$ itself, so $- \nabla E_{in} \propto w_{REG}$ , i.e. $\nabla E_{in} (w_{REG}) + (2 λ / N) w_{REG} = 0$ for some $λ \geq 0$ . This is exactly the stationarity condition for $E_{aug} (w) = E_{in} (w) + (λ / N) w^{⊤} w$ . The correspondence $C \leftrightarrow λ$ is monotone: larger $C$ (looser constraint) ↔ smaller $λ$ (lighter penalty).

Why is the augmented error $E_{aug} (w) = E_{in} (w) + (λ / N) Ω (w)$ a better proxy for $E_{out}$ than $E_{in}$ alone?

The VC bound has two terms: $E_{in}$ (fit) and $Ω (H)$ (set complexity). Pure ERM minimises only the first; $E_{out}$ depends on both. $E_{aug}$ adds a term proxying the complexity penalty, so minimising it minimises a closer approximation to the upper bound on $E_{out}$ . The proxy is heuristic — $Ω (w)$ for the chosen weights, not $Ω (H)$ for the whole set — but for “well-chosen” regularisers (e.g., $Ω (w) = ∥ w ∥^{2}$ ) it correlates with the hypothesis-set complexity that the algorithm effectively uses.

Compute the closed-form regularised solution for L2 linear regression. Why does adding $λ I$ to $Z^{⊤} Z$ also fix numerical conditioning issues?

$\nabla E_{aug} (w) = \frac{2}{N} Z^{⊤} (Zw - y) + \frac{2 λ}{N} w = 0$ . Multiplying through and rearranging: $(Z^{⊤} Z + λ I) w_{REG} = Z^{⊤} y$ , so $w_{REG} = (Z^{⊤} Z + λ I)^{- 1} Z^{⊤} y$ . Adding $λ I$ shifts every eigenvalue of $Z^{⊤} Z$ up by $λ$ , so a near-zero eigenvalue (the source of ill-conditioning) becomes $λ$ , well away from zero. The condition number — ratio of largest to smallest eigenvalue — improves, and the inverse is numerically stable even when $Z^{⊤} Z$ alone would be singular or near-singular.

What is the effective VC dimension of an L2-regularised hypothesis set, and why is it usually smaller than the nominal $d_{VC}$ ?

The nominal $d_{VC} (H) = \tilde{d} + 1$ counts all weight vectors as candidates. With a regulariser, the algorithm only converges to weights inside the ball $H (C) = {w : w^{⊤} w \leq C}$ — a strictly smaller set, with smaller VC dimension. The effective VC dimension is $d_{EFF} (H, A) = d_{VC} (H (C))$ , where $C$ is the implicit budget set by $λ$ . So even though the nominal model class is highly expressive, the regularised algorithm effectively selects from a smaller class — explaining why the regularised generalisation gap is smaller than a naive $d_{VC}$ analysis would predict.

What's the practical procedure for choosing $λ$ , and why can't you read it off the data directly?

$λ^{*}$ depends on $σ^{2}$ (stochastic noise), target complexity (deterministic noise), and $N$ — none of which is directly observable. The standard procedure is validation: try a grid of $λ$ values, train on a portion of the training data, evaluate held-out error on the remainder, and pick the $λ$ minimising mean held-out error (often via $K$ -fold cross-validation). The held-out error is an unbiased estimate of $E_{out}$ , so the chosen $λ$ is the one that empirically generalises best — without requiring knowledge of the noise levels.

Course Notes

Explorer

regularization-ml

The Motivation

The Constrained View

From Constraint to Lagrangian

The Augmented Error

Effective VC Dimension

L2: Weight Decay

Choosing $λ$

General Regularisers

Practical Tips

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

regularization-ml

The Motivation

The Constrained View

From Constraint to Lagrangian

The Augmented Error

Effective VC Dimension

L2: Weight Decay

Choosing λ

General Regularisers

Practical Tips

Related

Active Recall

Graph View

Table of Contents

Backlinks

Choosing $λ$