A pre-processing map (often with ) that turns the inputs into a higher-dimensional feature space. A linear model run on can produce a non-linear decision boundary in the original space, while keeping its optimisation machinery and convexity properties unchanged.

The Trick

Many classification problems aren’t linearly separable. A single hyperplane in the original input space cannot capture circles, XOR-like patterns, or polynomial boundaries.

The fix is to send each input through a non-linear map — also called a basis expansion or feature transform — and apply a linear model in the resulting space:

In the new space, the model is still a linear function:

But , viewed as a function of the original , is non-linear. A line in -space can correspond to a curve, a circle, a polynomial, anything determined by .

The key reframing: what looks like a linear model in -space is a highly non-linear model when viewed back in the original input space. The non-linearity is hidden inside the pre-processing step , leaving the model itself — and all its convexity-based optimisation guarantees — perfectly linear.

Toy Example

Data on : blue points at , orange at . No linear function can separate them — a line on has at most one zero crossing.

Apply and the points become in .

A linear function can now separate them — for example, gives , which is positive outside and negative inside. The decision boundary in the original 1-D space is the pair of points and — a non-linear “boundary” produced by a linear model in lifted space.

Polynomial Basis Expansion

For polynomial decision boundaries of degree , include all monomials of total degree up to :

InputsDegree
2
2
3

Cross terms like are essential for capturing interactions — without them you can only express decision boundaries that are sums of single-variable functions.

The number of monomials of degree in variables is , growing roughly as for fixed . With 100 features and , has more than 170,000 components.

Other Bases

You’re not limited to polynomials. Useful non-polynomial expansions include:

  • Exponential / logarithmic: for data with multiplicative structure.
  • Sinusoidal: for periodic data; the basis of Fourier methods.
  • Indicator functions: has one component per region of input space; “one-hot” features for discrete cuts.
  • Radial basis functions: centred on each training point or a chosen set of centres.

The choice encodes a guess about the shape of the truth. Polynomial expansions are popular because they’re simple and connect cleanly with Taylor approximation intuition: any sufficiently smooth function looks polynomial locally.

Linearity Is Preserved Where It Matters

Think of as data pre-processing: from the optimisation algorithm’s point of view, the inputs are rather than , and that is the only change. Concretely, for logistic-regression with cross-entropy-loss:

The loss is still strictly convex in . gradient descent and IRLS still find the global optimum. Nothing in the optimiser changes.

ASIDE — "Linear in " vs "linear in "

When we call logistic regression a “linear model,” we mean linear in the parameters , not linear in the inputs. After basis expansion, the model is still linear in — only nonlinear in . This is the property that preserves the convex optimisation. If we had instead made the model non-linear in (as in a neural network with multiple weight layers), we would lose convexity entirely.

Caveats

Two costs to weigh against the gained expressiveness:

  • Dimensionality: may have far more components than , increasing memory, computation, and (especially) sample complexity. The number of training examples needed to fit a model reliably scales with the feature dimension.
  • Overfitting: more flexible models can fit training noise. A degree-4 polynomial classifier can perfectly classify any reasonable training set while generalising worse than a linear classifier with a few mistakes. Capacity must be justified by evidence — the central concern of generalisation theory.

The Kernel Trick (Preview)

For algorithms whose optimisation depends on inputs only through inner products — notably SVMs in their dual form — there is a way to use without ever computing it explicitly. A kernel function can sometimes be evaluated in time even when has thousands of components. This makes very high (even infinite) dimensional basis expansions practical. Covered in week 4.

Active Recall