week-03

TARGET DECK MachineLearning::Week-03

Non-linear Transformation

What is a non-linear transformation (basis expansion), and what does it allow a linear model to do?

A pre-processing map $ϕ : R^{d} \to R^{D}$ (typically with $D > d$ ) that lifts each input into a higher-dimensional feature space. A linear model $h (x) = w^{⊤} ϕ (x)$ in the lifted space corresponds to a non-linear decision boundary in the original space — circles, polynomials, anything determined by $ϕ$ . The model is still linear in $w$ , so optimisation machinery and convexity are preserved.

For input $x = (1, x_{1}, x_{2})^{⊤}$ , what is the polynomial basis expansion of degree 2?

$ϕ (x) = (1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})^{⊤}$ All monomials of total degree $\leq 2$ . The cross term $x_{1} x_{2}$ is essential — without it, the model can only express boundaries that are sums of separate single-variable functions of $x_{1}$ and $x_{2}$ (no rotated ellipses, for example).

Why does basis expansion not break the convexity of logistic regression's loss?

Convexity of the cross-entropy loss is a property of $w$ , not of $x$ . After basis expansion, the loss becomes: $E (w) = - \sum_{i} [y^{(i)} ln σ (w^{⊤} ϕ (x^{(i)})) + \dots]$ still linear in $w$ , still wrapping the same convex sigmoid composition. The new inputs $ϕ (x^{(i)})$ are constants from the optimiser’s perspective. Linearity in $w$ is preserved; nothing in optimisation changes.

What two costs do you pay for using a high-degree polynomial basis expansion?

Dimensionality: the number of monomials of degree $\leq p$ in $d$ variables grows as $(p d + p) \approx d^{p}$ . With 100 inputs and $p = 3$ , $ϕ (x)$ has more than 170{,}000 components — heavy on memory, computation, and especially sample complexity.

Overfitting: more capacity means the model can fit training noise. A degree-4 polynomial can perfectly classify any reasonable training set while generalising worse than a linear classifier.

Margin and SVM

What is the margin of a separating hyperplane?

The perpendicular distance from the decision boundary to the closest training example: $margin = min_{n} \frac{∣ w ^{⊤} x ^{(n)} + b ∣}{∥ w ∥}$ A wider margin means more buffer between the boundary and the data — small perturbations are less likely to flip a label. The quantity SVMs maximise.

What is a support vector, and why are non-support-vectors irrelevant to the SVM solution?

A support vector is a training point sitting exactly on the margin envelope ( $y^{(n)} h (x^{(n)}) = 1$ ). These are the points closest to the decision boundary in each class. If you deleted every non-support-vector and re-trained, you’d get the same hyperplane — only the support vectors actively constrain the optimum; everyone else has slack and contributes nothing.

Write the canonical hard-margin SVM primal problem.

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} x^{(n)} + b) \geq 1 \forall n$ The objective minimises $∥ w ∥^{2}$ (which maximises margin $1/∥ w ∥$ ); each constraint requires every training point to lie on the correct side of the margin envelope. A convex quadratic program.

Why does maximising margin reduce to minimising $∥ w ∥$ ?

Apparent paradox — bigger weights should give bigger margin, right? No. After canonical rescaling (where the closest training point has $y^{(n)} h (x^{(n)}) = 1$ ), its perpendicular distance to the boundary is exactly $1/∥ w ∥$ . To make that distance large, $∥ w ∥$ must be small. Larger weights mean a steeper linear function, which can only achieve $h = \pm 1$ closer to the boundary — corresponding to a smaller margin.

What is quadratic programming (QP), and why is the SVM primal a QP?

A QP minimises a convex quadratic objective subject to linear inequality (and possibly equality) constraints. The SVM primal $min \frac{1}{2} w^{⊤} w$ s.t. $y^{(n)} (w^{⊤} x^{(n)} + b) \geq 1$ matches: quadratic objective (in $w$ ), linear inequality constraints. Off-the-shelf QP solvers handle this in polynomial time, and convexity guarantees a unique global optimum.

What is the canonical representation of a separating hyperplane?

The scaling of $(w, b)$ where the closest training point satisfies $y^{(n)} (w^{⊤} x^{(n)} + b) = 1$ . Used because the geometric hyperplane is invariant to multiplying $(w, b)$ by any $κ > 0$ — the canonical scaling fixes that ambiguity by demanding the closest point hits the value $\pm 1$ . Under this normalisation the margin is $1/∥ w ∥$ and the constraint becomes $y^{(n)} h (x^{(n)}) \geq 1$ for all $n$ .

Comparison

A degree-4 polynomial classifier perfectly classifies all training data; a linear classifier misclassifies a few. Which is likely to generalise better, and why?

Often the linear classifier. Training accuracy is not the goal — generalisation is. The degree-4 polynomial has enough capacity to memorise training noise, drawing wildly contorted boundaries to capture every point; those contortions are unlikely to reflect the true underlying structure. Capacity must be matched to data — a principle formalised later by VC dimension and bias–variance analysis.

What's the structural difference between SVM and logistic regression in how they use training data?

Logistic regression: every training point contributes to the gradient at every iteration; gradient $\nabla E = \sum_{i} (p_{1}^{(i)} - y^{(i)}) x^{(i)}$ touches all $N$ examples.

SVM: only support vectors (typically a tiny fraction) determine the hyperplane; non-SVs could be deleted without changing the optimum.

SVMs achieve a kind of data efficiency by structurally identifying which examples actually matter. Logistic regression has no such notion of “irrelevant” training points.

After basis expansion, what is the SVM primal in $ϕ$ -space?

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} s.t. y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 \forall n$ Same form, but $w$ now lives in $ϕ$ -space. The decision boundary is non-linear in the original space (e.g., a circle from the embedding $ϕ (x) = (∥ x ∥^{2}, x_{1}, x_{2})$ ). The QP machinery is unchanged — the practical concern is that $dim ϕ$ may be huge, motivating the dual formulation and kernels in week 4.

Course Notes

Explorer

week-03

Non-linear Transformation

Margin and SVM

Comparison

Graph View

Table of Contents