week-04

TARGET DECK MachineLearning::Week-04

Lagrangian Duality

What is the Lagrangian of a constrained optimisation problem?

For minimising $F (x)$ subject to $f_{i} (x) \leq 0$ for $i = 1, \dots, N$ : $L (x, a) = F (x) + \sum_{i = 1}^{N} a_{i} f_{i} (x), a_{i} \geq 0$ Each multiplier $a_{i} \geq 0$ folds an inequality constraint into the objective. When a constraint is violated, $a_{i} f_{i} > 0$ inflates $L$ , pushing the optimiser away.

What is the difference between the primal and dual problems?

Primal: $min_{x} max_{a \geq 0} L (x, a)$ — outer min, inner max.

Dual: $max_{a \geq 0} min_{x} L (x, a)$ — outer max, inner min.

Swapping order makes the inner optimisation unconstrained, so we can solve $\nabla_{x} L = 0$ in closed form, substitute back, and get a problem purely in $a$ .

What are weak duality and strong duality?

Weak duality (always holds): $max_{a} min_{x} L \leq min_{x} max_{a} L$ . The dual gives a lower bound on the primal.

Strong duality (holds when the problem is convex and Slater’s condition is satisfied — both true for SVM): equality. Dual and primal have the same optimum.

For SVMs, strong duality means we can solve the dual instead of the primal without losing anything.

What are the four KKT conditions for a primal-dual optimum?

A pair $(x^{*}, a^{*})$ is jointly optimal iff:

Stationarity: $\nabla_{x} L (x^{*}, a^{*}) = 0$

Complementary slackness: $a_{i}^{*} f_{i} (x^{*}) = 0$ for all $i$ — either the multiplier is zero or the constraint is active

Primal feasibility: $f_{i} (x^{*}) \leq 0$

Dual feasibility: $a_{i}^{*} \geq 0$

Complementary slackness is what produces SVM’s sparsity.

SVM Dual

Write the SVM dual problem.

$ar g max_{a} \sum_{n = 1}^{N} a^{(n)} - \frac{1}{2} \sum_{n = 1}^{N} \sum_{m = 1}^{N} a^{(n)} a^{(m)} y^{(n)} y^{(m)} ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ subject to $a^{(n)} \geq 0$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ . Three things to notice: (1) variables are now $a^{(n)}$ — one per training example, independent of $dim ϕ$ ; (2) data appears only through inner products; (3) concave in $a$ , still a QP.

How does the dual recover $w^{*}$ from the multipliers $a^{(n)}$ ?

Setting $\partial L / \partial w = 0$ gives: $w^{*} = \sum_{n = 1}^{N} a^{(n)} y^{(n)} ϕ (x^{(n)})$ The optimal weights are a linear combination of the training feature vectors, weighted by their multipliers and labels. By complementary slackness, only support vectors have $a^{(n)} > 0$ — so $w^{*}$ depends only on the support vectors.

Why does KKT complementary slackness force SVM solutions to be sparse?

The condition $a^{(n)} (1 - y^{(n)} h (x^{(n)})) = 0$ requires, for each $n$ , either $a^{(n)} = 0$ or $y^{(n)} h (x^{(n)}) = 1$ (point on margin). So:

Points strictly outside the margin: $a^{(n)} = 0$ , contribute nothing.

Points on the margin: $a^{(n)} > 0$ — these are the support vectors.

Sparsity is a structural consequence of the KKT conditions, not a heuristic.

How do you compute SVM predictions in dual form?

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} ϕ (x^{(n)})^{⊤} ϕ (x) + b$ where $S$ is the index set of support vectors. The bias $b$ is recovered from any support vector ( $y^{(n)} h (x^{(n)}) = 1$ ); average over all SVs for stability. Predictions depend only on inner products — the opening for the kernel trick.

Kernel Trick

What is the kernel trick?

Replace every inner product $ϕ (x)^{⊤} ϕ (z)$ in the SVM dual with a kernel function $k (x, z)$ that returns the same value computed entirely in the original input space: $k (x, z) = ϕ (x)^{⊤} ϕ (z)$ Now $ϕ$ is implicit — the algorithm needs only $k$ , often computable in $O (d)$ even when $dim ϕ$ is huge or infinite. Lets us use very high (even infinite) dimensional embeddings for free.

What is Mercer's condition, and what does it guarantee?

A function $k (x, z)$ is a valid kernel — i.e., corresponds to an inner product in some Hilbert space — iff for every finite set of points ${x^{(i)}}$ :

Symmetry: $k (x, z) = k (z, x)$

Positive semidefiniteness: the Gram matrix $K_{ij} = k (x^{(i)}, x^{(j)})$ has $z^{⊤} K z \geq 0$ for all $z$

Mercer’s theorem guarantees the existence of an embedding $ϕ$ such that $k = ϕ^{⊤} ϕ$ without requiring you to construct $ϕ$ explicitly. Algorithmically, PSD keeps the SVM dual a convex QP.

What is the formula for the polynomial kernel, and what does it implicitly compute?

$k (x, z) = (1 + x^{⊤} z)^{p}$ Corresponds to an explicit embedding of all monomials of degree $\leq p$ . For $x, z \in R^{d}$ with $p$ small, the implicit feature space has $(d d + p) = O (d^{p})$ dimensions — but the kernel is computed in $O (d)$ regardless. Massive speed-up.

What is the formula for the Gaussian (RBF) kernel, and why is its implicit feature space infinite-dimensional?

$k (x, z) = exp (- \frac{∥ x - z ∥ ^{2}}{2 σ ^{2}})$ Expanding the exponential as a Taylor series produces all monomial degrees: $e^{x^{⊤} z} = \sum_{j = 0}^{\infty} \frac{( x ^{⊤} z ) ^{j}}{j !}$ The implicit $ϕ$ contains monomials of every degree — infinite-dimensional. We could never form $ϕ$ explicitly, but $k$ is computed in $O (d)$ .

A polynomial-of-degree-2 kernel for 2D inputs has the explicit feature map $ϕ (x) = (1, 2 x_{1}, 2 x_{2}, x_{1}^{2}, x_{2}^{2}, 2 x_{1} x_{2})$ . What is $ϕ (x)^{⊤} ϕ (z)$ as a single closed-form expression?

Computing the inner product term by term: $ϕ (x)^{⊤} ϕ (z) = 1 + 2 x_{1} z_{1} + 2 x_{2} z_{2} + x_{1}^{2} z_{1}^{2} + x_{2}^{2} z_{2}^{2} + 2 x_{1} x_{2} z_{1} z_{2} = (1 + x^{⊤} z)^{2}$ Six multiplications collapse into one inner product plus one square. This is exactly the polynomial kernel of degree 2.

Given valid kernels $k_{1}, k_{2}$ , list four operations that produce another valid kernel.

Any of:

$c \cdot k_{1}$ for $c \geq 0$

$f (x) \cdot k_{1} (x, z) \cdot f (z)$ for any function $f$

$q (k_{1})$ for a polynomial $q$ with non-negative coefficients

$exp (k_{1})$

$k_{1} + k_{2}$

$k_{1} \cdot k_{2}$

These composition rules let you build complex kernels (Gaussian, polynomial-with-bias) without verifying Mercer’s condition from scratch.

Comparison

For an SVM with $N = 1, 000$ training examples and a polynomial-of-degree-3 kernel for 100-dimensional inputs, how many variables does the primal vs the dual have?

Primal: $dim ϕ + 1 = (3 103) + 1 \approx 176, 851$ — one per feature dimension.

Dual: $N = 1, 000$ — one per training example.

When $dim ϕ ≫ N$ , the dual is dramatically smaller. Plus, the dual depends only on inner products, so the kernel trick avoids ever forming $ϕ$ explicitly.

Course Notes

Explorer

week-04

Lagrangian Duality

SVM Dual

Kernel Trick

Comparison

Graph View

Table of Contents