A decomposition method for the soft-margin SVM dual: instead of solving the full -variable QP at once, repeatedly pick a pair of Lagrange multipliers , solve the two-variable subproblem analytically, and iterate until KKT conditions are met within tolerance. Sidesteps the memory and time cost of off-the-shelf QP.
Why Decomposition
The soft-margin SVM dual
subject to and is a convex QP with variables. Generic QP solvers store the Gram matrix (, often dense) and run in — infeasible past tens of thousands of examples.
SMO observes that we can restrict attention to a small subset of variables at a time, fix the rest, and solve the resulting subproblem cheaply. With two variables it has a closed-form solution.
Why Two, Not One
The equality constraint couples all multipliers. If you change a single while keeping everything else fixed, the constraint forces:
So is pinned. Updating one multiplier alone is impossible without violating feasibility. Two is the minimum number that can change while preserving — a change in can be absorbed by a compensating change in .
The Algorithm
Initialise a^(n) = 0 for all n. # feasible: satisfies both constraints
Repeat until convergence:
Select pair (i, j) heuristically (see below).
Compute candidate a^(j,new) by analytic update.
Clip a^(j,new) to feasible interval [L, H].
Set a^(i,new) so that ζ = a^(i) y^(i) + a^(j) y^(j) is preserved.
Initialisation. for all trivially satisfies and .
Convergence. Stop when no training example violates the KKT conditions within a specified tolerance .
The Two-Variable Update
Fix all for . The equality constraint becomes:
so is determined by . Substitute into the dual objective and differentiate w.r.t. . The closed-form update is:
where is the prediction error on example (using the current -values).
Intuition. The denominator is — the squared distance between the two examples in feature space. The numerator scales with the disagreement between the two errors. Examples that are far apart and have very different errors produce the largest moves; nearby points or points already in agreement produce small ones.
Clipping to the Box
The unclipped update may push outside , or push (computed from ) outside the box. Define:
- If (the line ):
- If (the line ):
Clip the candidate:
Then recover from the equality .
How to Pick the Pair
Selection is heuristic — convergence is guaranteed regardless, but speed depends heavily on the choice.
Selecting (the first multiplier). Alternate between two strategies:
- Pick randomly among examples that violate the KKT conditions (within tolerance ).
- Pick randomly among non-bound multipliers (those with — i.e., margin support vectors) that violate KKT.
Strategy 2 keeps the search focused on the active set; strategy 1 occasionally checks bound multipliers in case they should leave the bound.
Selecting (the second multiplier). Try in order until a positive improvement in is observed:
- Maximum step heuristic. Pick the that maximises — this approximates the largest update to (since the numerator of the SMO update is ).
- Iterate over non-bound multipliers () until improvement.
- Iterate over the entire training set until improvement.
- If still no improvement, replace and try again.
The errors are typically cached and updated incrementally so the heuristic is cheap.
Convergence
The KKT conditions for soft-margin SVM, in terms of :
| Condition | Meaning |
|---|---|
| Non-SV: outside margin | |
| Margin SV: on the margin exactly | |
| Bound SV: on, inside, or violating |
SMO checks each example against these (within tolerance ). When no example violates, the algorithm has found the optimum.
Worked Example
Three points: ; ; . Current , , linear kernel (). Pretend and . Update the pair .
Errors: and .
Kernels: , , .
Denominator: .
Clip. Since , . The candidate , so set .
Recover : , so .
The pair becomes — multiplier mass shifted from example 1 to example 2.
What Could Go Wrong
- Numerical instability when . The denominator approaches zero, blowing up the update. Handle by skipping the pair.
- Bad pair selection. The algorithm always converges, but lazy heuristics (e.g., random both) can be orders of magnitude slower than the maximum-step heuristic.
- Tolerance too tight. Setting too small means pairs keep flipping with no real progress. The standard tolerance is around .
Connections
- Solves soft-margin-svm — and hard-margin (set , drop the upper bound).
- Uses kkt-conditions — the optimality and stopping test.
- Why two-variable subproblems are special — the equality constraint rules out one-variable updates; two is the minimum that preserves feasibility and admits a closed-form analytic update. Larger subsets would need numerical sub-solvers.
- Alternatives — interior-point methods, gradient projection, coordinate descent on the primal hinge-loss form. SMO dominates for kernel SVMs because of the analytic update; for linear SVMs, primal solvers like LIBLINEAR are typically faster.