The Least Absolute Shrinkage and Selection Operator — linear regression with an L1 penalty on the weights. Unlike ridge, lasso drives some coefficients to exactly zero, producing sparse solutions. Useful when the underlying model is sparse (most features irrelevant) and as an automatic feature-selection tool.

The Objective

Written in matrix form:

Compared to ridge, the only difference is the regulariser: instead of . Equivalent constrained form: subject to .

Why L1 Produces Sparsity

The constraint sets have qualitatively different shapes:

  • L2 ball : a sphere — smooth, no corners.
  • L1 ball : a “diamond” (cross-polytope) — corners and edges exactly on the coordinate axes.

The optimisation finds the point in the constraint set closest (in contour terms) to the unregularised optimum . Geometrically, the contours of — ellipses around — first touch the constraint set at:

  • For L2: typically a generic point on the sphere — every coordinate non-zero.
  • For L1: typically a corner of the diamond — some coordinates exactly zero.

The corners of the L1 ball are sparse vectors (axis-aligned points), and the geometry makes the optimum land on them generically. This is the source of lasso’s sparsity: the corners of the L1 constraint are the sparse solutions, and they’re geometrically attractive to the optimum.

Properties

PropertyL2 (ridge)L1 (lasso)
Convex
Differentiable everywhere✗ (kink at )
Closed-form solution✗ (no closed form)
SolverLinear algebraQuadratic programming, coordinate descent, ISTA, FISTA
Sparse solution✗ (weights small but non-zero)✓ (some weights exactly zero)
Stable under correlated features✗ (picks one of a correlated group)

The non-differentiability of at zero is the source of both lasso’s sparsity (gradient has a “discontinuity at zero” that pins weights there) and its computational difficulty (no nice closed form).

When Lasso Wins

Lasso is the right choice when the true model is sparse: most features are irrelevant, and you want the algorithm to select which ones matter.

Example. You fit a 50th-order polynomial when the true target is 3rd-order. Lasso recovers the three relevant coefficients and sets the other 47 to zero. Ridge keeps all 50 small but non-zero, hiding the structural sparsity in the noise.

When the true model is dense (most features matter, but with small effects), ridge typically wins: lasso’s pickier selection discards information that ridge retains.

The lecture’s concrete demonstration: 20-dimensional ground-truth model with 5 non-zero coefficients fit to a 20-degree polynomial with a small . Vanilla regression gives wildly oscillating coefficients (overfitting). Ridge dampens them all but keeps every coefficient non-zero. Lasso correctly identifies the 5 non-zero coefficients and zeroes the rest.

Bayesian Interpretation

Where ridge corresponds to a Gaussian prior on , lasso corresponds to a Laplace (double-exponential) prior:

The Laplace distribution has heavier tails and a sharper peak at zero than a Gaussian of comparable variance. The “sharper peak” is what manifests as sparsity in MAP estimation — the prior prefers exactly-zero weights to slightly-non-zero ones, where Gaussian prefers slight non-zero to exactly zero.

RegulariserPrior
L2 (ridge)Gaussian — smooth shrinkage
L1 (lasso)Laplace — sparsity
L1 + L2 (elastic net)Mixture — sparsity with stability under correlation

Choosing

Same procedure as ridge: cross-validate on a grid of values and pick the one minimising held-out error. Commonly, the lasso path (the trajectory of as varies) is computed once and the cross-validation pick is read off — algorithms like LARS produce the entire path efficiently.

A practical rule: pick the largest within one standard error of the minimum cross-validation error. The “1-SE rule” prefers slightly more regularisation than strictly optimal — paying a tiny statistical cost for a more parsimonious, more interpretable model.

What Could Go Wrong

  • Correlated features. Lasso picks roughly one feature per correlated group and zeros the rest, even if all are equally relevant. The choice between correlated features is unstable across training sets. Elastic net (L1 + L2) addresses this by reintroducing some L2 stability.
  • Dense ground truth. When most coefficients are non-zero, lasso’s aggressive zeroing throws away signal. Ridge is the better default.
  • Standardisation. Same as ridge: L1 penalises raw weight magnitudes, so input scaling matters. Standardise features first.
  • Selection-induced bias. Selected features have biased coefficient estimates (they were chosen because the data made them look strong). Post-selection inference requires care.

Active Recall