Beyond the Line: Basis Expansion and Maximum Margin

THE CRUX: Logistic regression draws a straight line. What do we do when the truth isn't a straight line — and among all straight lines that do work, how do we pick the best one?

We extend linear models to nonlinear problems via basis expansion: send the inputs through a nonlinear map $ϕ$ and let the linear model live in the new space. Then we tackle a separate question — when many hyperplanes separate the data, the best one is the one that sits as far as possible from the closest training points. That insight gives us Support Vector Machines.

We finished week 2 with logistic regression in good working order: a hypothesis set, a cross-entropy loss, and two ways to optimise it (gradient descent and IRLS). But the underlying classifier still draws a straight line — a hyperplane in the original feature space. Two problems with that:

Many problems aren’t linearly separable. Concentric rings, points arranged like an XOR, polynomially-curved boundaries — a single hyperplane can’t capture them.
Even when the data is linearly separable, there are infinitely many separating hyperplanes, and our existing methods give no principled way to choose between them.

This week tackles each problem in turn. They turn out to be loosely related: both fixes preserve the linear machinery but reinterpret what “linear” applies to.

Part 1: Basis Expansion — Linear Models in Disguise

Suppose your data has a quadratic boundary in 1-D: blue on the outside, orange in the middle on the $x_{1}$ axis. No linear function $w_{0} + w_{1} x_{1}$ can separate them — a line on $R$ has at most one zero crossing.

But consider the map $ϕ (x_{1}) = (1, x_{1}, x_{1}^{2})^{⊤}$ . Each point is sent into 3-D, and a linear function in the new space, $w_{0} + w_{1} x_{1} + w_{2} x_{1}^{2}$ , is exactly the quadratic we wanted. The trick: by climbing into a higher-dimensional space, we make a non-linearly separable problem linearly separable. The original logistic regression model, run on $ϕ (x)$ instead of $x$ , now produces a curved decision boundary in the original input space.

This is the non-linear transformation (or basis expansion) idea. Substitute $x \to ϕ (x)$ everywhere it appears:

$logit (p_{1}) = w^{⊤} ϕ (x) p_{1} = σ (w^{⊤} ϕ (x))$

The cross-entropy loss, gradient, Hessian, and update rules are identical — just with $ϕ (x^{(i)})$ in place of $x^{(i)}$ . From the optimisation algorithm’s point of view, the data has been pre-processed; everything else carries on unchanged.

ASIDE — "Linear" in what?

When we say logistic regression is a “linear” model, we mean linear in its parameters $w$ , not linear in the inputs. After basis expansion, the model is still linear in $w$ — just nonlinear in $x$ . This is what makes it tractable: the optimisation problem doesn’t change shape (the cross-entropy loss is still strictly convex in $w$ ), only the input vectors do.

Building the Right $ϕ$

For polynomial decision boundaries of degree $\leq p$ , include all monomials of degree up to $p$ . With one input variable and degree 2:

$x = (1, x_{1})^{⊤} \to ϕ (x) = (1, x_{1}, x_{1}^{2})^{⊤}$

With two inputs and degree 2 you also need the cross term $x_{1} x_{2}$ :

$x = (1, x_{1}, x_{2})^{⊤} \to ϕ (x) = (1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})^{⊤}$

For degree 3 the list grows further. The number of monomials of degree $\leq p$ in $d$ variables grows roughly as $d^{p}$ — quickly painful for high-dimensional inputs.

You’re not limited to polynomials. Any non-linear map is fair game: $ϕ (x) = (1, x_{1}, e^{x_{1}})^{⊤}$ if you suspect exponential structure, sinusoidal bases for periodic data, indicator functions for discrete cuts, and so on. The choice encodes a guess about the shape of the truth.

What Could Go Wrong

Two caveats sit on the other side of the trick:

High dimension. Polynomial expansion explodes the number of features. With 100 inputs and degree-3 expansion, $ϕ (x)$ has tens of thousands of components. Storage, computation, and (especially) sample complexity all suffer.
Overfitting. A high-degree polynomial can fit any training set perfectly while wiggling wildly between points. Fitting the training data is no guarantee of generalising — and a 4th-order polynomial may generalise worse than a linear fit despite explaining the training data more accurately. This problem is the central concern of weeks 8–10.

A 4th-order polynomial perfectly classifies all training data. A linear classifier misclassifies a few. Which is the better choice, and why isn't training accuracy the deciding factor?

Often the linear classifier. Training accuracy is not what we ultimately care about — we care about generalisation to unseen examples. A 4th-order polynomial has enough capacity to memorise the training noise, drawing wildly contorted boundaries to capture every point. Those contortions are unlikely to reflect the true underlying structure, so the model performs worse on new data than a simpler classifier that “lets” a few outliers be misclassified. Capacity must be matched to evidence — a principle formalised by VC dimension and bias-variance analysis later in the module.

Part 2: Among Many Separators, Which is Best?

Now switch problem. Suppose the data is linearly separable. Logistic regression and several other algorithms can find some separating hyperplane, but they don’t all find the same one. Two hyperplanes might both achieve zero training error while sitting in very different positions. Which is preferable?

Geometric intuition: a separator that just barely splits the classes — passing close to a training point — is fragile. A small perturbation in that point (or any new test point in its neighbourhood) could land on the wrong side. A separator that runs through the middle of the no-man’s-land between classes, leaving wide buffer zones on both sides, is more robust.

Formalise the buffer as the margin: the perpendicular distance from the decision boundary to the closest training example.

The Maximum Margin Idea

Among all hyperplanes that correctly separate the data, choose the one that maximises the margin. The closest training points — the ones that touch the maximum-margin boundary’s “envelope” — are the support vectors. They alone determine the boundary; every other point could be removed without changing the answer. This is what defines a Support Vector Machine (SVM).

TIP — Why support vectors are special

If you took an SVM, deleted every training point except the support vectors, and re-trained, you’d get exactly the same hyperplane. The non-support-vectors are slack — they sit safely on their side of the boundary and contribute nothing to the optimum. Compare with logistic regression, where every training point contributes to the gradient at every iteration. SVMs achieve a kind of “data efficiency” by structurally identifying which examples actually matter for the decision boundary.

Setting Up the Optimisation

For a hyperplane $h (x) = w^{⊤} x + b$ , the perpendicular distance from a point $x^{(n)}$ to the hyperplane is:

$dist (h, x^{(n)}) = \frac{∣ h ( x ^{(n)} ) ∣}{∥ w ∥}$

The denominator $∥ w ∥$ is the Euclidean norm — this normalises away the fact that scaling $w$ and $b$ by any constant $κ$ doesn’t change the geometric hyperplane.

The optimisation problem is then:

$ar g max_{w, b} {min_{n} \frac{∣ h ( x ^{(n)} ) ∣}{∥ w ∥}} subject to y^{(n)} h (x^{(n)}) > 0 for all n$

Reading the formula: of all $(w, b)$ that classify everything correctly, find the pair where the closest training point is as far from the boundary as possible.

The constraint $y^{(n)} h (x^{(n)}) > 0$ uses labels $y \in {+ 1, - 1}$ — different from logistic regression’s ${0, 1}$ convention. The product is positive when the prediction agrees with the label and negative when it disagrees. Note that $y^{(n)} h (x^{(n)}) = ∣ h (x^{(n)}) ∣$ once the constraint is satisfied, so the absolute value can be dropped:

$ar g max_{w, b} {min_{n} \frac{y ^{(n)} h ( x ^{(n)} )}{∥ w ∥}}$

The Canonical Form

The argmax-min nested optimisation looks intimidating. There’s a clever rescaling that flattens it.

Notice that scaling $(w, b) \to (κ w, κb)$ leaves the hyperplane geometry unchanged: same boundary, same distances. So we can fix the scale by demanding that for the closest training point, $y^{(n)} h (x^{(n)}) = 1$ . This is the canonical representation of the hyperplane.

Under this normalisation the constraint becomes $y^{(n)} h (x^{(n)}) \geq 1$ for all $n$ (with equality at the closest point), and the inner min in the objective becomes $1$ . The whole problem collapses to:

$ar g max_{w, b} \frac{1}{∥ w ∥} subject to y^{(n)} h (x^{(n)}) \geq 1 for all n$

Maximising $1/∥ w ∥$ is equivalent to minimising $∥ w ∥$ , which is equivalent to minimising $\frac{1}{2} ∥ w ∥^{2}$ (the half is decorative — its derivative cancels the 2 from differentiating the square; the squared norm is preferred because it’s smooth and convex). The final clean form:

$ar g w, b min \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} x^{(n)} + b) \geq 1 \forall n$

This is a quadratic program: a convex quadratic objective with linear inequality constraints. Off-the-shelf QP solvers handle it efficiently, and convexity guarantees a unique global optimum.

Why does maximising margin reduce to minimising $∥ w ∥$ ? It seems backwards — bigger weights should give a bigger margin.

The intuition reverses once you account for normalisation. After canonical rescaling, the closest training point has $y^{(n)} h (x^{(n)}) = 1$ , which means its perpendicular distance to the hyperplane is $1/∥ w ∥$ . To make that distance large, $∥ w ∥$ must be small. The weights determine the steepness of the linear function, not the geometric position of the hyperplane — a steeper function can only achieve $h = \pm 1$ if you’re closer to the boundary, so a steep $w$ corresponds to a small margin.

Combining the Two Ideas

Nothing in the SVM derivation requires that the data be linearly separable in the original space. Substitute $x \to ϕ (x)$ via a basis expansion and you get an SVM whose decision boundary is non-linear in the original space:

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 \forall n$

The classical SVM example — concentric rings of two classes, impossible to separate linearly — becomes trivially separable in $ϕ (x) = (∥ x ∥^{2}, x_{1}, x_{2}, \dots)$ space. The maximum-margin separator in the lifted space corresponds to a circular (or polynomial) decision boundary in the original space. The support vectors are the points of each class closest to the boundary — usually a handful of points, even when the dataset is large.

Combining SVMs with high-dimensional $ϕ$ raises the same dimensionality concern as before, plus a more practical worry: does the QP scale? In week 4 we’ll see that there’s a dual formulation in which the optimisation depends on $ϕ$ only through inner products $ϕ (x^{(i)})^{⊤} ϕ (x^{(j)})$ — which we can compute via kernel functions without ever explicitly forming $ϕ$ . That’s the kernel trick, and it’s what makes SVMs practical in very high dimensions.

Concepts Introduced This Week

non-linear-transformation — basis expansion $ϕ (x)$ ; lifts data into a higher-dimensional space where linear methods can produce non-linear decision boundaries in the original space.
support-vector-machine — maximum-margin linear classifier; the hyperplane is determined entirely by the closest training points (the support vectors).
margin — perpendicular distance from the decision boundary to the closest training example; the quantity SVMs maximise.
quadratic-programming — a convex optimisation problem class with quadratic objective and linear constraints; the canonical SVM problem reduces to one.

Connections

Builds on week-02: basis expansion plugs straight into the same gradient-descent and IRLS machinery; the loss landscape stays convex in $w$ .
Builds on week-01: directly answers the open question “what if the data isn’t linearly separable?”
Sets up later weeks: the dual SVM formulation and kernels (week 4); soft-margin SVMs for non-separable data; the bias-variance / VC-dimension framing of overfitting risk for high-dimensional $ϕ$ (weeks 8–10).

Open Questions

High-dimensional $ϕ$ is expensive — can we avoid computing it explicitly? (Answered next week: the kernel trick uses inner products only.)
What if the data isn’t linearly separable even after basis expansion? (Soft-margin SVMs, with slack variables.)
How do we choose the right $ϕ$ — degree, basis family, dimension? (Cross-validation, regularisation, model selection — later in the module.)

Course Notes

Explorer

Beyond the Line: Basis Expansion and Maximum Margin

Part 1: Basis Expansion — Linear Models in Disguise

Building the Right $ϕ$

What Could Go Wrong

Part 2: Among Many Separators, Which is Best?

The Maximum Margin Idea

Setting Up the Optimisation

The Canonical Form

Combining the Two Ideas

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

Beyond the Line: Basis Expansion and Maximum Margin

Part 1: Basis Expansion — Linear Models in Disguise

Building the Right ϕ

What Could Go Wrong

Part 2: Among Many Separators, Which is Best?

The Maximum Margin Idea

Setting Up the Optimisation

The Canonical Form

Combining the Two Ideas

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks

Building the Right $ϕ$