support-vector-machine

A linear classifier that, among all hyperplanes correctly separating the training data, picks the one that sits as far as possible from the closest training point. Maximising this perpendicular distance — the margin — turns out to be a convex quadratic program with a unique global optimum.

The Idea in One Picture

Linearly separable data admits infinitely many separating hyperplanes. Some pass close to a training point on one side; some pass through the wide gap in the middle. The latter are intuitively safer: small perturbations to a point are less likely to push it across the boundary. SVMs formalise “safer” as the margin — the perpendicular distance from the boundary to the closest training point — and pick the hyperplane that maximises it.

The training points that touch the resulting margin envelope are called support vectors. They are the only points that affect the boundary; every other training point sits comfortably on its side and could be removed without changing the answer.

The Hypothesis Set

$h (x) = w^{⊤} x + b$

with prediction:

$\overset{y}{^} (x) = {+ 1 - 1 if h (x) > 0 if h (x) < 0$

SVM uses labels $y \in {+ 1, - 1}$ rather than ${0, 1}$ — a convention that makes the constraints cleaner. The product $y^{(n)} h (x^{(n)})$ is positive exactly when the prediction agrees with the label.

Unlike logistic-regression, SVM is not a probabilistic classifier — it doesn’t try to model $P (y ∣ x)$ . It just predicts which side of the hyperplane a point lies on.

Deriving the Optimisation

Step 1 — Express the margin. The perpendicular distance from $x^{(n)}$ to the hyperplane $h (x) = 0$ is $∣ h (x^{(n)}) ∣/∥ w ∥$ . The margin is the minimum over training points:

$γ = min_{n} \frac{∣ h ( x ^{(n)} ) ∣}{∥ w ∥}$

Step 2 — Maximise the margin, subject to correctness. All training examples must be correctly classified — i.e., $y^{(n)} h (x^{(n)}) > 0$ — and among those classifiers we want the one with the largest margin:

$ar g max_{w, b} {min_{n} \frac{y ^{(n)} h ( x ^{(n)} )}{∥ w ∥}} subject to y^{(n)} h (x^{(n)}) > 0 \forall n$

The absolute value drops because, under the correctness constraint, $y^{(n)} h (x^{(n)}) = ∣ h (x^{(n)}) ∣$ .

Step 3 — Canonical rescaling. The hyperplane is unchanged if we multiply $(w, b)$ by any positive scalar $κ$ . Use that freedom to fix the scale so that the closest training point satisfies $y^{(n)} h (x^{(n)}) = 1$ . Under this rescaling:

The constraint becomes $y^{(n)} h (x^{(n)}) \geq 1$ for all $n$ , with equality for at least one (the closest) point.
The inner $min$ in the objective becomes $1$ .
The objective collapses to maximising $1/∥ w ∥$ .

Step 4 — Convert max to min. Maximising $1/∥ w ∥$ is equivalent to minimising $∥ w ∥$ , which is equivalent to minimising $\frac{1}{2} ∥ w ∥^{2}$ . The half is conventional — it makes the gradient $w$ rather than $2 w$ . The squared norm is preferred because it’s smooth, strictly convex, and quadratic (which lets us use QP solvers).

The SVM Primal Problem

$ar g w, b min \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} x^{(n)} + b) \geq 1 \forall (x^{(n)}, y^{(n)}) \in T$

This is a quadratic program: convex quadratic objective with linear inequality constraints. Properties:

Convex. Unique global optimum; any descent method reaches it.
Quadratic. Off-the-shelf QP solvers handle it efficiently.
Sparse solution. At the optimum only a few constraints are active (binding with equality) — these correspond to the support vectors.

Why the Two Constraint Forms Are Equivalent

The optimal SVM solution has $y^{(n)} h (x^{(n)}) = 1$ for some training points (the support vectors) and $> 1$ for the rest. So the constraints " $y^{(n)} h (x^{(n)}) \geq 1$ " and " $min_{n} y^{(n)} h (x^{(n)}) = 1$ " describe the same optimum:

The $\geq 1$ form is looser — it permits $min_{n} > 1$ .
The $= 1$ form is stricter — it forces the closest point to be exactly on the margin.

Because we’re minimising $∥ w ∥^{2}$ , we want $∥ w ∥$ to be as small as possible — i.e., the margin $1/∥ w ∥$ to be as large as possible. The smallest $∥ w ∥$ that satisfies $\geq 1$ everywhere is the one that achieves $= 1$ at the binding constraint. The two formulations have the same optimum.

Support Vectors

At the optimum, each training point falls into one of three cases:

Condition	Meaning
$y^{(n)} h (x^{(n)}) = 1$	On the margin — support vector
$y^{(n)} h (x^{(n)}) > 1$	Strictly beyond margin — does not affect the boundary
$y^{(n)} h (x^{(n)}) < 0$	Misclassified — impossible in hard-margin SVM (constraint violated)

The third case can’t occur because the constraints rule it out. (For non-separable data, soft-margin SVMs introduce slack variables that allow misclassifications at a cost — covered later.)

The boundary depends only on the support vectors. Delete any non-support-vector point and re-train: the same hyperplane comes out. This is the structural reason SVMs are described as “data efficient” — only a handful of training points actually matter for the decision.

Combining With Basis Expansion

Nothing in the derivation requires that the data be linearly separable in the original space. Plug in a basis expansion $ϕ (x)$ and you get a non-linear SVM:

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 \forall n$

The boundary is a hyperplane in $ϕ$ -space, but a curved (polynomial, circular, etc.) surface in the original $x$ space. With $ϕ (x) = (1, x_{1}, x_{2}, x_{1}^{2}, x_{2}^{2}, x_{1} x_{2})$ the SVM can carve out elliptical decision boundaries — perfect for the canonical “concentric rings” example.

The catch: $ϕ$ may be very high-dimensional, making the QP large. The fix — using kernel functions to evaluate $ϕ (x)^{⊤} ϕ (x^{'})$ without ever forming $ϕ (x)$ — is the kernel trick, covered in week 4 via the dual formulation (see below).

The Dual Representation

The primal QP has $dim ϕ + 1$ unknowns ( $w, b$ ). Apply Lagrange duality: the constraints $1 - y^{(n)} h (x^{(n)}) \leq 0$ each get a multiplier $a^{(n)} \geq 0$ , and minimax-swap to maxmin. Setting $\partial L / \partial w = 0$ and $\partial L / \partial b = 0$ gives:

$w^{*} = \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)}), \sum_{n} a^{(n)} y^{(n)} = 0$

Substituting back eliminates $w$ and $b$ , leaving the dual:

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n, m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$

subject to $a^{(n)} \geq 0$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ .

Three structural consequences:

The dual has $N$ unknowns, not $dim ϕ$ . When $dim ϕ ≫ N$ — high-degree polynomial expansion or Gaussian kernel — the dual is dramatically smaller.
Data appears only through inner products $ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ . Replace each with a kernel function $k (x^{(n)}, x^{(m)})$ evaluated in the original space — that’s the kernel-trick.
Sparsity from complementary slackness. KKT’s $a^{(n)} (1 - y^{(n)} h (x^{(n)})) = 0$ forces $a^{(n)} = 0$ for every non-support-vector. The dual sum collapses to a sum over support vectors only.

Predictions in dual form:

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$

where $S$ is the support-vector index set, and $b$ is computed by averaging over support vectors:

$b = \frac{1}{∣ S ∣} \sum_{n \in S} (y^{(n)} - \sum_{m \in S} a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)}))$

SVM vs Logistic Regression

	Logistic regression	SVM
Output	Calibrated probability $p_{1} \in (0, 1)$	Class label $\pm 1$ (no probability)
Loss	Cross-entropy (smooth, all examples contribute)	Hinge / margin-based (only support vectors contribute)
Decision boundary	Any separating hyperplane the optimiser converges to	Maximum-margin hyperplane (unique)
Convex?	Yes	Yes
Handles non-separable?	Yes (still converges, weights blow up if perfectly separable)	Hard-margin SVM fails; soft-margin SVM works
Labels	${0, 1}$	${+ 1, - 1}$
Support for $ϕ$ ?	Yes (basis expansion)	Yes (and kernel trick in dual form)

For probability calibration, choose logistic regression. For maximising margin and exploiting the kernel trick, choose SVM. Both are linear-in- $w$ models, both are convex, and both extend non-linearly via $ϕ$ .

margin — the geometric quantity SVM maximises
quadratic-programming — the optimisation form SVM reduces to
non-linear-transformation — combine with SVM for non-linear boundaries
logistic-regression — alternative linear classifier with a probabilistic output
decision boundary — the hyperplane SVM produces
lagrangian — duality is what produces the dual SVM and enables kernels
kkt-conditions — complementary slackness is the structural reason SVMs are sparse
kernel-trick — replace inner products in the dual with a kernel function

Active Recall

Why does SVM use labels $y \in {+ 1, - 1}$ rather than ${0, 1}$ ? Walk through the role of the product $y^{(n)} h (x^{(n)})$ .

The $\pm 1$ convention makes the correct-classification check uniform: $y^{(n)} h (x^{(n)}) > 0$ is true exactly when the predicted side matches the true label. With $y \in {+ 1}$ , $h (x) > 0$ is needed; with $y = - 1$ , $h (x) < 0$ is needed; the product handles both cases in one expression. The same trick lets us drop the absolute value $∣ h (x^{(n)}) ∣$ from the margin formula under the correctness constraint, simplifying the algebra significantly.

Why does maximising the margin reduce to minimising $∥ w ∥$ — surely larger weights should mean a larger margin?

The intuition reverses once you account for canonical rescaling. After rescaling so that the closest training point has $y^{(n)} h (x^{(n)}) = 1$ , the perpendicular distance from that point to the boundary is $1/∥ w ∥$ . To make the margin large, $∥ w ∥$ must be small. Geometrically, $∥ w ∥$ controls how rapidly $h$ changes as you move away from the boundary; a steep $h$ reaches the value $1$ close to the boundary (small margin), while a gentle $h$ reaches $1$ far away (large margin). The squared norm $\frac{1}{2} ∥ w ∥^{2}$ is used in the objective because it’s smooth and yields a clean QP.

A hard-margin SVM has been trained. Suppose you receive 5 new training examples, all of which sit safely outside the margin envelope on the correct side. How does the decision boundary change?

Not at all. None of the new points become support vectors (they don’t touch the margin envelope), so the binding constraints of the QP are unchanged. The optimal $(w^{*}, b^{*})$ is unchanged. This is the same property that made non-support-vectors disposable in the original training set: the boundary is determined entirely by the points sitting on the margin’s edge, regardless of how many “easy” points exist around them. (If a new point landed inside the margin or on the wrong side, the answer would be very different — for hard-margin SVM, no solution exists; for soft-margin SVM, the hyperplane shifts to accommodate it at a cost.)

Course Notes

Explorer

support-vector-machine

The Idea in One Picture

The Hypothesis Set

Deriving the Optimisation

The SVM Primal Problem

Why the Two Constraint Forms Are Equivalent

Support Vectors

Combining With Basis Expansion

The Dual Representation

SVM vs Logistic Regression

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

support-vector-machine

The Idea in One Picture

The Hypothesis Set

Deriving the Optimisation

The SVM Primal Problem

Why the Two Constraint Forms Are Equivalent

Support Vectors

Combining With Basis Expansion

The Dual Representation

SVM vs Logistic Regression

Related

Active Recall

Graph View

Table of Contents

Backlinks