non-linear-transformation

A pre-processing map $ϕ : R^{d} \to R^{D}$ (often with $D > d$ ) that turns the inputs into a higher-dimensional feature space. A linear model run on $ϕ (x)$ can produce a non-linear decision boundary in the original space, while keeping its optimisation machinery and convexity properties unchanged.

The Trick

Many classification problems aren’t linearly separable. A single hyperplane in the original input space cannot capture circles, XOR-like patterns, or polynomial boundaries.

The fix is to send each input $x$ through a non-linear map $ϕ$ — also called a basis expansion or feature transform — and apply a linear model in the resulting space:

$x \in R^{d} ⟶ ϕ (x) \in R^{D}$

In the new space, the model is still a linear function:

$h (x) = w^{⊤} ϕ (x)$

But $h$ , viewed as a function of the original $x$ , is non-linear. A line in $ϕ$ -space can correspond to a curve, a circle, a polynomial, anything determined by $ϕ$ .

The key reframing: what looks like a linear model in $ϕ$ -space is a highly non-linear model when viewed back in the original input space. The non-linearity is hidden inside the pre-processing step $ϕ$ , leaving the model itself — and all its convexity-based optimisation guarantees — perfectly linear.

Toy Example

Data on $R$ : blue points at $x_{1} \in {- 3, - 2, 4, 5}$ , orange at $x_{1} \in {0, 1}$ . No linear function $w_{0} + w_{1} x_{1}$ can separate them — a line on $R$ has at most one zero crossing.

Apply $ϕ (x_{1}) = (1, x_{1}, x_{1}^{2})^{⊤}$ and the points become $(1, - 3, 9), (1, - 2, 4), (1, 0, 0), (1, 1, 1), (1, 4, 16), (1, 5, 25)$ in $R^{3}$ .

A linear function $w_{0} + w_{1} x_{1} + w_{2} x_{1}^{2}$ can now separate them — for example, $w_{0} = 10, w_{1} = - 7, w_{2} = 1$ gives $h (x_{1}) = x_{1}^{2} - 7 x_{1} + 10 = (x_{1} - 2) (x_{1} - 5)$ , which is positive outside $[2, 5]$ and negative inside. The decision boundary in the original 1-D space is the pair of points $x_{1} = 2$ and $x_{1} = 5$ — a non-linear “boundary” produced by a linear model in lifted space.

Polynomial Basis Expansion

For polynomial decision boundaries of degree $\leq p$ , include all monomials of total degree up to $p$ :

Inputs	Degree	$ϕ (x)$
$x_{1}$	2	$(1, x_{1}, x_{1}^{2})$
$x_{1}, x_{2}$	2	$(1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})$
$x_{1}, x_{2}$	3	$(1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2}, x_{1}^{3}, x_{2}^{3}, x_{1} x_{2}^{2}, x_{1}^{2} x_{2})$

Cross terms like $x_{1} x_{2}$ are essential for capturing interactions — without them you can only express decision boundaries that are sums of single-variable functions.

The number of monomials of degree $\leq p$ in $d$ variables is $(p d + p)$ , growing roughly as $d^{p}$ for fixed $p$ . With 100 features and $p = 3$ , $ϕ (x)$ has more than 170,000 components.

Other Bases

You’re not limited to polynomials. Useful non-polynomial expansions include:

Exponential / logarithmic: $ϕ (x) = (1, x_{1}, e^{x_{1}}, ln (1 + ∣ x_{1} ∣))$ for data with multiplicative structure.
Sinusoidal: $ϕ (x) = (1, x_{1}, cos (x_{1}), sin (x_{1}))$ for periodic data; the basis of Fourier methods.
Indicator functions: $ϕ (x)$ has one component per region of input space; “one-hot” features for discrete cuts.
Radial basis functions: $ϕ_{i} (x) = exp (- ∥ x - c_{i} ∥^{2} /2 σ^{2})$ centred on each training point or a chosen set of centres.

The choice encodes a guess about the shape of the truth. Polynomial expansions are popular because they’re simple and connect cleanly with Taylor approximation intuition: any sufficiently smooth function looks polynomial locally.

Linearity Is Preserved Where It Matters

Think of $ϕ$ as data pre-processing: from the optimisation algorithm’s point of view, the inputs are $ϕ (x^{(i)})$ rather than $x^{(i)}$ , and that is the only change. Concretely, for logistic-regression with cross-entropy-loss:

$E (w) = - \sum_{i = 1}^{N} [y^{(i)} ln σ (w^{⊤} ϕ (x^{(i)})) + (1 - y^{(i)}) ln (1 - σ (w^{⊤} ϕ (x^{(i)})))]$

$\nabla E (w) = \sum_{i = 1}^{N} (p_{1} (ϕ (x^{(i)}), w) - y^{(i)}) ϕ (x^{(i)})$

The loss is still strictly convex in $w$ . gradient descent and IRLS still find the global optimum. Nothing in the optimiser changes.

ASIDE — "Linear in $w$ " vs "linear in $x$ "

When we call logistic regression a “linear model,” we mean linear in the parameters $w$ , not linear in the inputs. After basis expansion, the model is still linear in $w$ — only nonlinear in $x$ . This is the property that preserves the convex optimisation. If we had instead made the model non-linear in $w$ (as in a neural network with multiple weight layers), we would lose convexity entirely.

Caveats

Two costs to weigh against the gained expressiveness:

Dimensionality: $ϕ (x)$ may have far more components than $x$ , increasing memory, computation, and (especially) sample complexity. The number of training examples needed to fit a model reliably scales with the feature dimension.
Overfitting: more flexible models can fit training noise. A degree-4 polynomial classifier can perfectly classify any reasonable training set while generalising worse than a linear classifier with a few mistakes. Capacity must be justified by evidence — the central concern of generalisation theory.

The Kernel Trick (Preview)

For algorithms whose optimisation depends on inputs only through inner products — notably SVMs in their dual form — there is a way to use $ϕ$ without ever computing it explicitly. A kernel function $k (x, x^{'}) = ϕ (x)^{⊤} ϕ (x^{'})$ can sometimes be evaluated in $O (d)$ time even when $ϕ$ has thousands of components. This makes very high (even infinite) dimensional basis expansions practical. Covered in week 4.

support-vector-machine — combines basis expansion with maximum-margin classification
logistic-regression — accepts basis expansion transparently
generalization — high-dim $ϕ$ raises overfitting risk
decision boundary — what becomes non-linear in the original space

Active Recall

Why does applying a basis expansion $ϕ$ to logistic regression not break the convexity of the loss?

The cross-entropy loss is convex as a function of $w$ , regardless of what the inputs are. After basis expansion, the loss is $E (w) = - \sum_{i} [y^{(i)} ln σ (w^{⊤} ϕ (x^{(i)})) + \dots]$ , which is still convex in $w$ — the new inputs $ϕ (x^{(i)})$ are constants from the optimiser’s perspective. We’ve changed the inputs, not the function class of the loss in $w$ .

For two input variables $x_{1}, x_{2}$ , write the basis expansion for polynomial decision boundaries up to degree 2. Why must the cross term $x_{1} x_{2}$ be included?

$ϕ (x) = (1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})^{⊤}$ . The cross term captures interactions between features. Without it, the model can only express boundaries that are sums of separate single-variable functions of $x_{1}$ and $x_{2}$ — for example, an axis-aligned ellipse $x_{1}^{2} + x_{2}^{2} = c$ , but not a rotated one $(x_{1} - x_{2})^{2} = c$ , which expands to include $- 2 x_{1} x_{2}$ . Generally, no rotation-non-invariant quadratic boundary can be expressed without the cross term.

A degree-4 polynomial basis expansion lets a linear classifier perfectly classify all training data, while a linear classifier without expansion misclassifies a few points. Which is likely to generalise better, and why?

Often the un-expanded linear classifier. Training accuracy is not the goal — generalisation is. The high-degree expansion gives the classifier enough capacity to memorise the training noise, drawing wildly contorted boundaries that fit specific training points but don’t reflect the underlying structure. With limited data, the simpler model whose mistakes are evenly distributed (suggesting irreducible noise rather than systematic structure it’s missing) often generalises better. Capacity must be matched to the evidence in the data — a tension formalised by VC dimension and bias-variance analysis.

Course Notes

Explorer

non-linear-transformation

The Trick

Toy Example

Polynomial Basis Expansion

Other Bases

Linearity Is Preserved Where It Matters

Caveats

The Kernel Trick (Preview)

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

non-linear-transformation

The Trick

Toy Example

Polynomial Basis Expansion

Other Bases

Linearity Is Preserved Where It Matters

Caveats

The Kernel Trick (Preview)

Related

Active Recall

Graph View

Table of Contents

Backlinks