lasso-regression

The Least Absolute Shrinkage and Selection Operator — linear regression with an L1 penalty $λ ∥ w ∥_{1} = λ \sum_{i} ∣ w_{i} ∣$ on the weights. Unlike ridge, lasso drives some coefficients to exactly zero, producing sparse solutions. Useful when the underlying model is sparse (most features irrelevant) and as an automatic feature-selection tool.

The Objective

$w_{lasso} = ar g w min \frac{1}{2} i = 1 \sum N (y_{i} - w^{⊤} ϕ (x_{i}))^{2} + λ q = 1 \sum Q ∣ w_{q} ∣$

Written in matrix form:

$w_{lasso} = ar g min_{w} \frac{1}{2} ∥ y - Zw ∥^{2} + λ ∥ w ∥_{1} .$

Compared to ridge, the only difference is the regulariser: $∥ w ∥_{1}$ instead of $∥ w ∥_{2}^{2}$ . Equivalent constrained form: $min E_{in} (w)$ subject to $∥ w ∥_{1} \leq C$ .

Why L1 Produces Sparsity

The constraint sets have qualitatively different shapes:

L2 ball ${w : ∥ w ∥_{2} \leq C}$ : a sphere — smooth, no corners.
L1 ball ${w : ∥ w ∥_{1} \leq C}$ : a “diamond” (cross-polytope) — corners and edges exactly on the coordinate axes.

The optimisation finds the point in the constraint set closest (in $E_{in}$ contour terms) to the unregularised optimum $w_{lin}$ . Geometrically, the contours of $E_{in}$ — ellipses around $w_{lin}$ — first touch the constraint set at:

For L2: typically a generic point on the sphere — every coordinate non-zero.
For L1: typically a corner of the diamond — some coordinates exactly zero.

The corners of the L1 ball are sparse vectors (axis-aligned points), and the geometry makes the optimum land on them generically. This is the source of lasso’s sparsity: the corners of the L1 constraint are the sparse solutions, and they’re geometrically attractive to the optimum.

Properties

Property	L2 (ridge)	L1 (lasso)
Convex	✓	✓
Differentiable everywhere	✓	✗ (kink at $w_{q} = 0$ )
Closed-form solution	✓	✗ (no closed form)
Solver	Linear algebra	Quadratic programming, coordinate descent, ISTA, FISTA
Sparse solution	✗ (weights small but non-zero)	✓ (some weights exactly zero)
Stable under correlated features	✓	✗ (picks one of a correlated group)

The non-differentiability of $∣ w_{q} ∣$ at zero is the source of both lasso’s sparsity (gradient has a “discontinuity at zero” that pins weights there) and its computational difficulty (no nice closed form).

When Lasso Wins

Lasso is the right choice when the true model is sparse: most features are irrelevant, and you want the algorithm to select which ones matter.

Example. You fit a 50th-order polynomial when the true target is 3rd-order. Lasso recovers the three relevant coefficients and sets the other 47 to zero. Ridge keeps all 50 small but non-zero, hiding the structural sparsity in the noise.

When the true model is dense (most features matter, but with small effects), ridge typically wins: lasso’s pickier selection discards information that ridge retains.

The lecture’s concrete demonstration: 20-dimensional ground-truth model with 5 non-zero coefficients fit to a 20-degree polynomial with a small $N$ . Vanilla regression gives wildly oscillating coefficients (overfitting). Ridge dampens them all but keeps every coefficient non-zero. Lasso correctly identifies the 5 non-zero coefficients and zeroes the rest.

Bayesian Interpretation

Where ridge corresponds to a Gaussian prior on $w$ , lasso corresponds to a Laplace (double-exponential) prior:

$p (w_{q}) \propto exp (- ∣ w_{q} ∣/ b) .$

The Laplace distribution has heavier tails and a sharper peak at zero than a Gaussian of comparable variance. The “sharper peak” is what manifests as sparsity in MAP estimation — the prior prefers exactly-zero weights to slightly-non-zero ones, where Gaussian prefers slight non-zero to exactly zero.

Regulariser	Prior
L2 (ridge)	Gaussian — smooth shrinkage
L1 (lasso)	Laplace — sparsity
L1 + L2 (elastic net)	Mixture — sparsity with stability under correlation

Choosing $λ$

Same procedure as ridge: cross-validate on a grid of $λ$ values and pick the one minimising held-out error. Commonly, the lasso path (the trajectory of $w_{lasso}$ as $λ$ varies) is computed once and the cross-validation pick is read off — algorithms like LARS produce the entire path efficiently.

A practical rule: pick the largest $λ$ within one standard error of the minimum cross-validation error. The “1-SE rule” prefers slightly more regularisation than strictly optimal — paying a tiny statistical cost for a more parsimonious, more interpretable model.

What Could Go Wrong

Correlated features. Lasso picks roughly one feature per correlated group and zeros the rest, even if all are equally relevant. The choice between correlated features is unstable across training sets. Elastic net (L1 + L2) addresses this by reintroducing some L2 stability.
Dense ground truth. When most coefficients are non-zero, lasso’s aggressive zeroing throws away signal. Ridge is the better default.
Standardisation. Same as ridge: L1 penalises raw weight magnitudes, so input scaling matters. Standardise features first.
Selection-induced bias. Selected features have biased coefficient estimates (they were chosen because the data made them look strong). Post-selection inference requires care.

ridge-regression — L2 analogue; smooth shrinkage instead of sparsity.
regularization — the general framework; lasso is its L1 instance.
bayesian-linear-regression — Bayesian view with prior $\propto e^{- λ ∥ w ∥_{1}}$ .
overfitting — the disease lasso treats.

Active Recall

Geometrically, why does L1 regularisation drive some coefficients to exactly zero while L2 regularisation does not?

The L1 constraint set ${w : ∥ w ∥_{1} \leq C}$ is a diamond (cross-polytope) with corners on the coordinate axes — at sparse vectors. The L2 constraint set is a smooth sphere with no corners. The optimum is the point where the contours of $E_{in}$ first touch the constraint set; for the L1 diamond, this is generically a corner (sparse), and for the L2 sphere, generically a smooth point (dense). The corners of the L1 ball are precisely the sparse vectors, and they’re geometrically attractive — the optimum lands on them by default rather than as a special case.

A 20-dimensional regression problem has a true model with only 5 non-zero coefficients. You fit (i) ridge, (ii) lasso, both with cross-validated $λ$ . Predict qualitatively how the coefficient profiles will differ.

Ridge: all 20 coefficients shrunk towards zero but few exactly zero. The 5 true coefficients will be larger than the 15 spurious ones, but the spurious ones won’t vanish — they’ll be small non-zero values reflecting the L2 penalty’s smooth shrinkage. Lasso: the 5 true coefficients will be retained (perhaps slightly biased downward), and the 15 spurious ones will be set to exactly zero. Lasso recovers the true sparsity pattern; ridge does not. This is why lasso wins on sparse ground truth — it can express the structural fact “this feature doesn’t matter” exactly, where ridge can only approximate it.

Why does lasso fail to have a closed-form solution despite being convex?

Because the L1 norm $∥ w ∥_{1} = \sum ∣ w_{q} ∣$ is not differentiable at $w_{q} = 0$ . The gradient of the objective involves $sign (w_{q})$ , which is discontinuous at zero. Setting the gradient to zero doesn’t give a linear system — instead, the optimality conditions involve subgradients (case analysis depending on whether each $w_{q}$ is positive, negative, or zero at the optimum). Solvers like coordinate descent, ISTA, and FISTA handle this with iterative updates that include a “soft-thresholding” step explicitly accounting for the kink at zero. Convexity ensures convergence to the global optimum; non-differentiability just rules out the tidy closed form.

Lasso corresponds to a Laplace prior in the Bayesian view. What does the shape of the Laplace distribution explain about lasso's behaviour, compared to ridge's Gaussian prior?

The Laplace distribution $p (w) \propto e^{- ∣ w ∣/ b}$ has a sharp peak at zero — its density at exactly zero is much larger relative to nearby values than the Gaussian’s. This sharpness expresses a strong prior preference for exactly-zero weights. In MAP estimation, the optimum sits at the prior’s peak when data is weak, and the Laplace’s peak is exactly at zero — so lasso pins coefficients to zero unless data evidence overcomes the peak. Ridge’s Gaussian prior is smooth at zero, so its peak is “spread out” — MAP shrinks weights but rarely lands exactly on zero. The geometric corner-on-axis story for the L1 ball and the probabilistic sharp-peak-at-zero story for the Laplace prior are two views of the same fact.

Course Notes

Explorer

lasso-regression

The Objective

Why L1 Produces Sparsity

Properties

When Lasso Wins

Bayesian Interpretation

Choosing $λ$

What Could Go Wrong

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

lasso-regression

The Objective

Why L1 Produces Sparsity

Properties

When Lasso Wins

Bayesian Interpretation

Choosing λ

What Could Go Wrong

Related

Active Recall

Graph View

Table of Contents

Backlinks

Choosing $λ$