TARGET DECK MachineLearning::Week-04
Lagrangian Duality
What is the Lagrangian of a constrained optimisation problem?
For minimising subject to for : Each multiplier folds an inequality constraint into the objective. When a constraint is violated, inflates , pushing the optimiser away.
What is the difference between the primal and dual problems?
- Primal: — outer min, inner max.
- Dual: — outer max, inner min.
Swapping order makes the inner optimisation unconstrained, so we can solve in closed form, substitute back, and get a problem purely in .
What are weak duality and strong duality?
- Weak duality (always holds): . The dual gives a lower bound on the primal.
- Strong duality (holds when the problem is convex and Slater’s condition is satisfied — both true for SVM): equality. Dual and primal have the same optimum.
For SVMs, strong duality means we can solve the dual instead of the primal without losing anything.
What are the four KKT conditions for a primal-dual optimum?
A pair is jointly optimal iff:
- Stationarity:
- Complementary slackness: for all — either the multiplier is zero or the constraint is active
- Primal feasibility:
- Dual feasibility:
Complementary slackness is what produces SVM’s sparsity.
SVM Dual
Write the SVM dual problem.
subject to and . Three things to notice: (1) variables are now — one per training example, independent of ; (2) data appears only through inner products; (3) concave in , still a QP.
How does the dual recover from the multipliers ?
Setting gives: The optimal weights are a linear combination of the training feature vectors, weighted by their multipliers and labels. By complementary slackness, only support vectors have — so depends only on the support vectors.
Why does KKT complementary slackness force SVM solutions to be sparse?
The condition requires, for each , either or (point on margin). So:
- Points strictly outside the margin: , contribute nothing.
- Points on the margin: — these are the support vectors.
Sparsity is a structural consequence of the KKT conditions, not a heuristic.
How do you compute SVM predictions in dual form?
where is the index set of support vectors. The bias is recovered from any support vector (); average over all SVs for stability. Predictions depend only on inner products — the opening for the kernel trick.
Kernel Trick
What is the kernel trick?
Replace every inner product in the SVM dual with a kernel function that returns the same value computed entirely in the original input space: Now is implicit — the algorithm needs only , often computable in even when is huge or infinite. Lets us use very high (even infinite) dimensional embeddings for free.
What is Mercer's condition, and what does it guarantee?
A function is a valid kernel — i.e., corresponds to an inner product in some Hilbert space — iff for every finite set of points :
- Symmetry:
- Positive semidefiniteness: the Gram matrix has for all
Mercer’s theorem guarantees the existence of an embedding such that without requiring you to construct explicitly. Algorithmically, PSD keeps the SVM dual a convex QP.
What is the formula for the polynomial kernel, and what does it implicitly compute?
Corresponds to an explicit embedding of all monomials of degree . For with small, the implicit feature space has dimensions — but the kernel is computed in regardless. Massive speed-up.
What is the formula for the Gaussian (RBF) kernel, and why is its implicit feature space infinite-dimensional?
Expanding the exponential as a Taylor series produces all monomial degrees: The implicit contains monomials of every degree — infinite-dimensional. We could never form explicitly, but is computed in .
A polynomial-of-degree-2 kernel for 2D inputs has the explicit feature map . What is as a single closed-form expression?
Computing the inner product term by term: Six multiplications collapse into one inner product plus one square. This is exactly the polynomial kernel of degree 2.
Given valid kernels , list four operations that produce another valid kernel.
Any of:
- for
- for any function
- for a polynomial with non-negative coefficients
These composition rules let you build complex kernels (Gaussian, polynomial-with-bias) without verifying Mercer’s condition from scratch.
Comparison
For an SVM with training examples and a polynomial-of-degree-3 kernel for 100-dimensional inputs, how many variables does the primal vs the dual have?
- Primal: — one per feature dimension.
- Dual: — one per training example.
When , the dual is dramatically smaller. Plus, the dual depends only on inner products, so the kernel trick avoids ever forming explicitly.