Augmenting the training objective with a complexity penalty , scaled by , to bias the optimiser towards simpler hypotheses. The constrained form s.t. and the unconstrained form are Lagrangian-equivalent. The augmented objective is a better proxy for than alone, with standing in for the model-complexity penalty in the VC bound.

The Motivation

The VC bound says

where is the complexity of the hypothesis set. Two ways to keep small:

  1. Make small — fit the training data.
  2. Make small — use a simpler hypothesis set.

Pure ERM (empirical risk minimisation) only attacks the first. Regularisation attacks both: it minimises subject to a constraint on hypothesis complexity, or equivalently, minimises where measures the complexity of an individual hypothesis (a proxy for the set that the algorithm actually selects from).

The Constrained View

The regularisation idea, slowly introduced:

Hard constraint. Force certain weights to zero: minimise subject to . This is equivalent to using instead of — a discrete jump in capacity.

Looser constraint. Force at least 8 of to zero — but let the algorithm decide which ones. Now . More expressive than (you can choose which dimensions matter), less risky than (still constrained). Mathematically a sparsity constraint.

Soft constraint. The combinatorial sparsity is hard to optimise. Replace with a continuous proxy:

The hypothesis set is a closed ball of radius . As ranges from to , smoothly interpolates between and . The training problem becomes

From Constraint to Lagrangian

Constrained optimisation is harder than unconstrained. The Lagrangian turns it into the latter.

Geometrically: the optimal either sits inside the ball (, the constraint is inactive) or on the surface (, active). When active, two facts:

  1. is perpendicular to the surface (otherwise we could slide along the surface and decrease without violating the constraint).
  2. The normal to the surface at is the vector itself.

Combining: is parallel to , i.e.,

for some . This is exactly the gradient of the augmented error

So solving the constrained problem with budget is equivalent to unconstrained minimisation of with some specific . The correspondence: — bigger budget means lighter penalty.

TIP — Two-form duality is convenient

The constrained form is interpretable (“keep below ”); the augmented form is solvable (just gradient-step on ). Use the form that fits your purpose. They give the same for paired values.

The Augmented Error

The two components:

  • — the regulariser: a function of the hypothesis itself (e.g., ).
  • — the regularisation parameter: how aggressively to apply the brakes.

Heuristic interpretation. stands in for the hypothesis-set complexity in the VC bound. We can’t bound directly during training (it’s a property of the whole set), but for the chosen is computable and correlates with what hypothesis-set complexity we’d need to “actually contain” .

Practical interpretation. Minimising uses a better proxy for than alone. We can’t measure , but we can simulate the bound’s two terms (data fit + complexity) and minimise their sum.

Effective VC Dimension

The nominal hypothesis set has all weight vectors are considered candidates. But after regularisation, the algorithm only navigates within — the ball of radius — so the effective VC dimension

is smaller. This is the formal sense in which regularised algorithms generalise: they implicitly select from a smaller hypothesis set than the nominal , even though the nominal remains expressive enough to capture complex targets when regularisation is loose.

The slogan: large, while small if is regularised.

L2: Weight Decay

The most common regulariser is

For linear regression with this penalty, the augmented error is

Setting :

Compare to the OLS solution : the only change is added inside the inverse. This is ridge regression.

The L2 penalty is called weight decay because, in iterative optimisers, each step subtracts a multiple of — every weight “decays” toward zero. Larger means stronger decay, shorter , smaller effective .

Choosing

The trade-off is sensitive: too little regularisation lets overfitting through, too much produces underfitting.

The shape of expected vs is U-shaped: a sharp drop at small (fixing overfitting), a minimum at some optimal , then a slow rise (underfitting). The optimal depends on:

  • Stochastic noise. More noise → larger (more brakes for a bumpier road).
  • Deterministic noise (target complexity beyond ). More → larger .
  • Data size . More data → smaller (less need to regularise heavily).

In practice is unknown — neither nor target complexity is observable. The standard procedure is validation: try a grid of values, hold out part of the training data, and pick the minimising held-out error. This is Lec 1’s cross-validation half of the “two cures” — covered in detail next week.

General Regularisers

The L2 norm isn’t the only choice. The general family is

NameEffect
weight decay / ridgeShrinks weights smoothly toward 0
lasso (sparsity)Drives some weights to exactly 0
(non-convex)Aggressive sparsity, harder to optimise

L2 has a unique closed-form solution and is differentiable everywhere; L1 is non-differentiable at zero but produces sparse solutions, making it the right choice when you suspect the true model is sparse (most coefficients exactly zero). The regularisers are non-convex and rarely used outside specialised contexts.

The Bayesian view: each regulariser corresponds to a prior on . L2 ↔ Gaussian, L1 ↔ Laplace, ↔ super-Gaussian (heavy tails and sharp peaks).

Practical Tips

The lecture’s “tricks and tips”:

  • Try regularisation by default. It rarely hurts if is reasonable. Sweep a grid and let validation pick.
  • Higher noise → more regularisation. Heuristic argument: noise is “high frequency”, complex targets are also high frequency, so a low-frequency-favouring regulariser helps.
  • Modern deep learning is full of regularisation. L2 weight decay, dropout, batch norm, data augmentation, early stopping — all variants of the same idea. Different ‘s, different ‘s.

Active Recall