Going Dual: Lagrangians, Kernels, and the Trick That Makes Them Tractable

THE CRUX: We left week 3 with an SVM that works — but it sits in a $ϕ$ -space that's potentially huge (or infinite-dimensional). How do we actually solve it without paying that cost? And once we know the answer, how does it change our view of "what kind of similarity matters"?

The fix is to dualise the SVM optimisation. The primal asks for a hyperplane in $ϕ$ -space; the dual reformulates this in terms of how training points relate to each other through inner products $ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ . Once it’s all about inner products, we can replace them with a kernel function $k (x, z)$ that returns the same value computed entirely in the original space — never forming $ϕ$ . That’s the kernel trick. It transforms expressive but expensive embeddings into cheap function calls, opens the door to infinite-dimensional feature spaces (Gaussian kernels), and lets us define similarity directly on objects without numeric features (strings, trees, graphs).

Week 3 set up the SVM primal:

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} subject to y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 \forall n$

This is a quadratic program — convex, with a unique global optimum, solvable by off-the-shelf QP solvers. Job done, in principle. But two problems hide behind the formulation:

The embedding $ϕ (x)$ may be enormous. With 100 inputs and a degree-3 polynomial expansion, $ϕ$ has tens of thousands of components. The QP has thousands of variables. With a Gaussian-style kernel, $ϕ$ is infinite-dimensional and we can’t compute it at all.
The optimisation operates on the wrong currency. The primal works with the parameters $(w, b)$ . To say anything about a new test point we still need $ϕ (x)$ — same cost.

This week’s payoff: a different formulation of the same problem that depends on data only through inner products $ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ . Once we have that, we can replace each inner product with a kernel function evaluated in the original space, and never touch $ϕ$ again.

Part 1: Lagrange Relaxation

The standard tool for handling inequality constraints is the Lagrangian. Instead of forbidding regions where $f_{i} (x) > 0$ , we fold the constraint into the objective via a non-negative multiplier $a_{i} \geq 0$ :

$L (x, a) = F (x) + \sum_{i = 1}^{N} a_{i} f_{i} (x)$

When a constraint is violated ( $f_{i} > 0$ ), the term $a_{i} f_{i}$ inflates $L$ — the optimiser is pushed away from infeasible $x$ . When satisfied ( $f_{i} \leq 0$ ), the term is non-positive and at the optimum it’s zero (the multiplier “switches off”).

But naive Lagrangian relaxation has a problem: with fixed $a_{i}$ , the penalty might be too small to enforce the constraint. The fix is to maximise over $a_{i}$ :

$min_{x} max_{a \geq 0} L (x, a)$

If a constraint is violated, the inner max sends $a_{i} \to \infty$ , making $L$ infinite — the outer min refuses to land there. If all constraints are satisfied, the inner max settles at $a_{i} = 0$ (or wherever $f_{i} = 0$ ), giving $L = F$ . The minimax exactly reproduces the original constrained primal.

The Dual: Swapping Min and Max

The minimax requires solving a constrained max inside an unconstrained min — still awkward. The dual swaps the order:

$max_{a \geq 0} min_{x} L (x, a)$

Now the inner min is unconstrained. We can attack it with $\nabla_{x} L = 0$ , which often produces a closed form for $x^{*}$ in terms of $a$ . Substituting back gives a problem purely in $a$ .

Two duality theorems govern when this is safe:

Weak duality (always): $max_{a} min_{x} L \leq min_{x} max_{a} L$ . The dual gives a lower bound on the primal.
Strong duality (when convex + Slater’s condition holds — both true for SVM): the inequality is equality. Dual and primal have the same optimum.

When strong duality holds, the optimum is a saddle point of $L$ : a min along the $x$ direction, a max along the $a$ direction. Walking down-then-up or up-then-down both arrive at the same point.

KKT: When Are We Optimal?

Stationarity ( $\nabla F = 0$ ) is necessary and sufficient for unconstrained convex optimisation. With inequalities, we need more. The KKT conditions state that a primal-dual pair $(x^{*}, a^{*})$ is jointly optimal iff:

Stationarity. $\nabla_{x} L = 0$ .
Complementary slackness. $a_{i}^{*} f_{i} (x^{*}) = 0$ for all $i$ — either the multiplier is zero or the constraint is active.
Primal feasibility. $f_{i} (x^{*}) \leq 0$ .
Dual feasibility. $a_{i}^{*} \geq 0$ .

Complementary slackness is the structural reason SVMs are sparse — and we’ll see why shortly.

Part 2: SVM in Dual Form

Apply this machinery to the SVM primal. First, rewrite the constraint as $f_{n} (w, b) = 1 - y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \leq 0$ . The Lagrangian:

$L (w, b, a) = \frac{1}{2} ∥ w ∥^{2} + \sum_{n = 1}^{N} a^{(n)} (1 - y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b)), a^{(n)} \geq 0$

The minimax is:

$min_{w, b} max_{a} L = max_{a} min_{w, b} L (strong duality)$

Take the inner min by setting partials to zero:

$\frac{\partial L}{\partial w} = w - \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)}) = 0 \Rightarrow w^{*} = \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)})$

$\frac{\partial L}{\partial b} = - \sum_{n} a^{(n)} y^{(n)} = 0 \Rightarrow \sum_{n} a^{(n)} y^{(n)} = 0$

Substituting $w^{*}$ back into $L$ and using the second relation, $w$ and $b$ both vanish from the optimisation, leaving:

$ar g a max \tilde{L} (a) = n = 1 \sum N a^{(n)} - \frac{1}{2} n = 1 \sum N m = 1 \sum N a^{(n)} a^{(m)} y^{(n)} y^{(m)} ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$

subject to $a^{(n)} \geq 0$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ .

This is the SVM dual problem. Three things to notice:

The unknowns are now $a^{(n)}$ , not $(w, b)$ . The dual has $N$ variables (one per training example), independent of $dim ϕ$ . When $dim ϕ ≫ N$ , the dual is much smaller.
The data appears only through inner products $ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ . This is the opening for the kernel trick.
The objective is concave in $a$ — still a QP, still solvable.

Sparsity from Complementary Slackness

KKT’s complementary slackness says, for each training point:

$a^{(n)} (1 - y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b)) = 0$

Either:

$a^{(n)} = 0$ : the point sits strictly outside the margin envelope. It contributes nothing to the dual sum — and nothing to $w^{*} = \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)})$ .
$1 - y^{(n)} h (x^{(n)}) = 0$ : the point is on the margin — a support vector. These have $a^{(n)} > 0$ .

So at the optimum, only support vectors have non-zero multipliers. Predictions only need the support vectors — every other training point can be deleted. Sparsity is structural, not heuristic.

Predictions in the Dual

Substituting $w^{*} = \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)})$ into $h (x) = w^{⊤} ϕ (x) + b$ :

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} ϕ (x^{(n)})^{⊤} ϕ (x) + b$

where $S$ is the set of support-vector indices. For $b$ , use the fact that any support vector satisfies $y^{(n)} h (x^{(n)}) = 1$ , solve for $b$ , and average over all support vectors for numerical stability:

$b = \frac{1}{∣ S ∣} \sum_{n \in S} (y^{(n)} - \sum_{m \in S} a^{(m)} y^{(m)} ϕ (x^{(n)})^{⊤} ϕ (x^{(m)}))$

Predict by checking the sign of $h (x)$ .

Notice: everything depends on $ϕ$ only through pairwise inner products.

Part 3: The Kernel Trick

Define a kernel function $k (x, z) = ϕ (x)^{⊤} ϕ (z)$ . Replace every inner product in the dual with $k$ :

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n, m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})$

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$

Now $ϕ$ is implicit. The algorithm needs only $k$ , which we can often evaluate in $O (dim x)$ — independent of $dim ϕ$ .

The Worked Example

Let $ϕ (x) = (1, 2 x_{1}, 2 x_{2}, x_{1}^{2}, x_{2}^{2}, 2 x_{1} x_{2})$ — a 6-dimensional polynomial-of-degree-2 embedding for 2D inputs. Computing the inner product:

$ϕ (x)^{⊤} ϕ (z) = 1 + 2 x_{1} z_{1} + 2 x_{2} z_{2} + x_{1}^{2} z_{1}^{2} + x_{2}^{2} z_{2}^{2} + 2 x_{1} x_{2} z_{1} z_{2} = (1 + x^{⊤} z)^{2}$

Six multiplications in feature space collapse into one inner product plus one square in the original space. Generalising:

$k (x, z) = (1 + x^{⊤} z)^{p}$

is the [[polynomial-kernel|polynomial kernel of degree $p$ ]] — corresponds to the embedding of all monomials of degree $\leq p$ . A 100-dimensional input with $p = 5$ implies $\sim 1 0^{7}$ feature dimensions, but the kernel computes in 100 multiplications.

Mercer’s Condition: Which Functions Are Valid Kernels?

We’ve been assuming there’s a $ϕ$ such that $k (x, z) = ϕ (x)^{⊤} ϕ (z)$ . Not every “similarity-shaped” function admits such a $ϕ$ . The criterion is Mercer’s condition:

For any finite set of points, the Gram matrix $K_{ij} = k (x^{(i)}, x^{(j)})$ must be:

Symmetric: $k (x, z) = k (z, x)$ .
Positive semidefinite: $z^{⊤} K z \geq 0$ for all $z$ .

If both hold, $k$ corresponds to an inner product in some Hilbert space — Mercer’s theorem guarantees the existence of $ϕ$ without requiring us to compute it. Algorithmically, positive semidefiniteness keeps the SVM dual a convex QP; without it, the optimisation can wander to nonsense.

Building Kernels Without Verifying Mercer

Verifying Mercer’s condition from scratch is hard. Instead, build kernels by composition. Given valid $k_{1}, k_{2}$ , the following are also valid:

$c k_{1}, f (x) k_{1} f (z), q (k_{1}) for non-negative-coefficient q, e^{k_{1}}, k_{1} + k_{2}, k_{1} \cdot k_{2}$

The polynomial kernel is built this way: start from $k_{1} = x^{⊤} z$ (valid), apply $q (k_{1}) = (1 + k_{1})^{p}$ , done.

The Gaussian (RBF) Kernel

$k (x, z) = exp (- \frac{∥ x - z ∥ ^{2}}{2 σ ^{2}})$

The implicit $ϕ$ is infinite-dimensional — its Taylor series contains monomials of all degrees. We could never compute $ϕ$ explicitly, but the kernel-trick sidesteps that: $k$ is one subtraction, one squared norm, one exponential.

The validity proof chains the composition rules from $k_{1} = x^{⊤} z$ :

$k_{1}^{j} / j!$ valid (polynomial of $k_{1}$ with non-negative coefficients).
$\sum_{j} k_{1}^{j} / j! = e^{k_{1}}$ valid (sum of valid kernels, or directly the exponential rule).
$f (x) e^{k_{1}} f (z)$ with $f (x) = e^{- ∥ x ∥^{2} /2 σ^{2}}$ → exactly the Gaussian kernel.

The Gaussian is the default non-linear kernel: maximally expressive, no strong structural assumptions. It pairs naturally with SVM’s margin-maximisation — even with infinite-dimensional features, maximising margin keeps the boundary regularised.

ASIDE — Kernels for non-numeric data

Once we accept that $ϕ$ doesn’t have to be constructed explicitly, kernels can be defined directly on objects without numeric features. The all-subsequence kernel for strings counts how many subsequences are shared (computable in $O (∣ s ∣∣ t ∣)$ via dynamic programming). The all-subtree kernel for parse trees counts shared subtrees. The set kernel $k (A, B) = 2^{∣ A \cap B ∣}$ counts shared subsets. Each of these implies an embedding $ϕ$ into a high-dimensional space — but we never realise it, just evaluate the kernel.

Part 4: When to Pick Which Kernel

Kernel	When	Caveat
Linear $k (x, z) = x^{⊤} z$	High-dim input, linearly separable	Can’t capture non-linearity
Polynomial $(1 + x^{⊤} z)^{p}$	Polynomial structure suspected (vision)	Sensitive to $p$ ; high-degree overfits
Gaussian / RBF	Default for non-linear; no strong prior	Needs careful $σ$ ; sensitive to feature scale
Custom (string, tree, graph)	Domain-specific structure	Must verify Mercer or compose validly

The sklearn SVC default is RBF for good reason — it’s the kernel that asks the fewest assumptions of the data.

What Could Go Wrong

Overfitting. Gaussian kernel maps to infinite-dimensional space; without margin maximisation it would memorise the training data. Hard-margin SVM with the wrong $σ$ still overfits. Soft-margin SVM (next week) introduces slack to mitigate this.
Non-Mercer kernels. A function that “looks like similarity” but fails positive semidefiniteness yields a non-convex dual, with no guarantee of finding the optimum and no margin interpretation. Stick to validated building blocks.
Feature scaling. Both polynomial and Gaussian kernels depend on $x^{⊤} z$ or $∥ x - z ∥$ — the largest-scale feature will dominate. Standardise inputs first.

Concepts Introduced This Week

lagrangian — the Lagrangian function and Lagrange duality; how minimax/maxmin let us swap a constrained primal for an unconstrained-inner dual.
kkt-conditions — Karush–Kuhn–Tucker optimality conditions for convex constrained optimisation; complementary slackness is what produces SVM’s sparsity.
kernel-trick — replace inner products $ϕ (x)^{⊤} ϕ (z)$ with a kernel function $k (x, z)$ computed in the original space.
mercers-condition — symmetric + positive-semidefinite Gram matrix tests whether a candidate function is a valid kernel.
gaussian-kernel — Radial Basis Function kernel; infinite-dimensional embedding; default choice for non-linear SVM.
polynomial-kernel — finite-dimensional embedding of all monomials up to degree $p$ .

Connections

Builds on week-03: takes the SVM primal (a QP in $ϕ$ -space) and dualises it; the support-vector intuition becomes a structural KKT consequence rather than a mere observation.
Builds on non-linear-transformation: kernels generalise basis expansion — same boundary shape, no need to ever materialise $ϕ$ .
Sets up later weeks: soft-margin SVM (relaxes the hard-margin constraint to handle non-separable data); regularisation and bias-variance (overfitting risk for high-capacity kernels); generalisation theory (why margin maximisation regularises infinite-dimensional embeddings).

Open Questions

What if the data isn’t separable even in the kernel-induced space? (Soft-margin SVMs with slack variables — next week.)
How do we choose the kernel and its hyperparameters in practice? (Cross-validation; later weeks formalise this as model selection.)
Why does maximising margin still give a well-behaved boundary even when the implicit $ϕ$ is infinite-dimensional? (VC dimension / generalisation bounds, in weeks 8–10.)

Course Notes

Explorer

Going Dual: Lagrangians, Kernels, and the Trick That Makes Them Tractable

Part 1: Lagrange Relaxation

The Dual: Swapping Min and Max

KKT: When Are We Optimal?

Part 2: SVM in Dual Form

Sparsity from Complementary Slackness

Predictions in the Dual

Part 3: The Kernel Trick

The Worked Example

Mercer’s Condition: Which Functions Are Valid Kernels?

Building Kernels Without Verifying Mercer

The Gaussian (RBF) Kernel

Part 4: When to Pick Which Kernel

What Could Go Wrong

Concepts Introduced This Week

Connections

Open Questions

Graph View

Table of Contents

Backlinks