sequential-minimal-optimization

A decomposition method for the soft-margin SVM dual: instead of solving the full $N$ -variable QP at once, repeatedly pick a pair of Lagrange multipliers $(a^{(i)}, a^{(j)})$ , solve the two-variable subproblem analytically, and iterate until KKT conditions are met within tolerance. Sidesteps the $O (N^{2})$ memory and $O (N^{3})$ time cost of off-the-shelf QP.

Why Decomposition

The soft-margin SVM dual

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n, m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})$

subject to $0 \leq a^{(n)} \leq C$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ is a convex QP with $N$ variables. Generic QP solvers store the Gram matrix ( $N \times N$ , often dense) and run in $O (N^{3})$ — infeasible past tens of thousands of examples.

SMO observes that we can restrict attention to a small subset of variables at a time, fix the rest, and solve the resulting subproblem cheaply. With two variables it has a closed-form solution.

Why Two, Not One

The equality constraint $\sum_{n} a^{(n)} y^{(n)} = 0$ couples all multipliers. If you change a single $a^{(m)}$ while keeping everything else fixed, the constraint forces:

$a^{(m)} y^{(m)} = - \sum_{n \neq = m} a^{(n)} y^{(n)} = constant$

So $a^{(m)}$ is pinned. Updating one multiplier alone is impossible without violating feasibility. Two is the minimum number that can change while preserving $\sum_{n} a^{(n)} y^{(n)} = 0$ — a change in $a^{(i)}$ can be absorbed by a compensating change in $a^{(j)}$ .

The Algorithm

Initialise a^(n) = 0 for all n.       # feasible: satisfies both constraints
Repeat until convergence:
    Select pair (i, j) heuristically (see below).
    Compute candidate a^(j,new) by analytic update.
    Clip a^(j,new) to feasible interval [L, H].
    Set a^(i,new) so that ζ = a^(i) y^(i) + a^(j) y^(j) is preserved.

Initialisation. $a^{(n)} = 0$ for all $n$ trivially satisfies $0 \leq a^{(n)} \leq C$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ .

Convergence. Stop when no training example violates the KKT conditions within a specified tolerance $e$ .

The Two-Variable Update

Fix all $a^{(n)}$ for $n \neq = i, j$ . The equality constraint becomes:

$a^{(i)} y^{(i)} + a^{(j)} y^{(j)} = ζ (constant)$

so $a^{(i)}$ is determined by $a^{(j)}$ . Substitute into the dual objective and differentiate w.r.t. $a^{(j)}$ . The closed-form update is:

$a^{(j, new)} = a^{(j)} + \frac{y ^{(j)} ( E ^{(i)} - E ^{(j)} )}{k ( x ^{(i)} , x ^{(i)} ) + k ( x ^{(j)} , x ^{(j)} ) - 2 k ( x ^{(i)} , x ^{(j)} )}$

where $E^{(n)} = f (x^{(n)}) - y^{(n)}$ is the prediction error on example $n$ (using the current $a$ -values).

Intuition. The denominator is $∥ ϕ (x^{(i)}) - ϕ (x^{(j)}) ∥^{2}$ — the squared distance between the two examples in feature space. The numerator scales with the disagreement between the two errors. Examples that are far apart and have very different errors produce the largest moves; nearby points or points already in agreement produce small ones.

Clipping to the Box

The unclipped update may push $a^{(j, new)}$ outside $[0, C]$ , or push $a^{(i)}$ (computed from $a^{(j)}$ ) outside the box. Define:

If $y^{(i)} = y^{(j)}$ (the line $a^{(i)} + a^{(j)} = ζ / y^{(i)}$ ): $L = max (0, a^{(j)} + a^{(i)} - C), H = min (C, a^{(j)} + a^{(i)})$
If $y^{(i)} \neq = y^{(j)}$ (the line $a^{(j)} - a^{(i)} = ζ / y^{(j)}$ ): $L = max (0, a^{(j)} - a^{(i)}), H = min (C, a^{(j)} - a^{(i)} + C)$

Clip the candidate:

$a^{(j, new&clipped)} = ⎩ ⎨ ⎧ H a^{(j, new)} L if a^{(j, new)} \geq H if L < a^{(j, new)} < H if a^{(j, new)} \leq L$

Then recover $a^{(i, new)}$ from the equality $a^{(i, new)} y^{(i)} + a^{(j, new&clipped)} y^{(j)} = ζ$ .

How to Pick the Pair

Selection is heuristic — convergence is guaranteed regardless, but speed depends heavily on the choice.

Selecting $a^{(i)}$ (the first multiplier). Alternate between two strategies:

Pick randomly among examples that violate the KKT conditions (within tolerance $e$ ).
Pick randomly among non-bound multipliers (those with $0 < a^{(i)} < C$ — i.e., margin support vectors) that violate KKT.

Strategy 2 keeps the search focused on the active set; strategy 1 occasionally checks bound multipliers in case they should leave the bound.

Selecting $a^{(j)}$ (the second multiplier). Try in order until a positive improvement in $\tilde{L}$ is observed:

Maximum step heuristic. Pick the $j$ that maximises $∣ E^{(i)} - E^{(j)} ∣$ — this approximates the largest update to $a^{(j)}$ (since the numerator of the SMO update is $y^{(j)} (E^{(i)} - E^{(j)})$ ).
Iterate over non-bound multipliers ( $0 < a^{(j)} < C$ ) until improvement.
Iterate over the entire training set until improvement.
If still no improvement, replace $a^{(i)}$ and try again.

The errors $E^{(n)}$ are typically cached and updated incrementally so the heuristic is cheap.

Convergence

The KKT conditions for soft-margin SVM, in terms of $a^{(n)}$ :

Condition	Meaning
$a^{(n)} = 0 ⟺ y^{(n)} h (x^{(n)}) \geq 1$	Non-SV: outside margin
$0 < a^{(n)} < C ⟺ y^{(n)} h (x^{(n)}) = 1$	Margin SV: on the margin exactly
$a^{(n)} = C ⟺ y^{(n)} h (x^{(n)}) \leq 1$	Bound SV: on, inside, or violating

SMO checks each example against these (within tolerance $e$ ). When no example violates, the algorithm has found the optimum.

Worked Example

Three points: $x^{(1)} = (0.1, 0.1), y^{(1)} = + 1$ ; $x^{(2)} = (0.1, 0.2), y^{(2)} = + 1$ ; $x^{(3)} = (0.2, 0.2), y^{(3)} = - 1$ . Current $a^{(1)} = 0.4, a^{(2)} = 0, a^{(3)} = 0.4$ , $C = 1$ , linear kernel ( $ϕ (x) = x$ ). Pretend $h (x^{(1)}) = 0.3$ and $h (x^{(2)}) = - 0.3$ . Update the pair $(j = 1, i = 2)$ .

Errors: $E^{(2)} = h (x^{(2)}) - y^{(2)} = - 0.3 - 1 = - 1.3$ and $E^{(1)} = h (x^{(1)}) - y^{(1)} = 0.3 - 1 = - 0.7$ .

Kernels: $k_{11} = 0.02$ , $k_{22} = 0.05$ , $k_{12} = 0.03$ .

Denominator: $0.05 + 0.02 - 2 (0.03) = 0.01$ .

$a^{(1, new)} = 0.4 + \frac{1 \cdot ( - 1.3 - ( - 0.7 ))}{0.01} = 0.4 + \frac{- 0.6}{0.01} = - 59.6$

Clip. Since $y^{(1)} = y^{(2)}$ , $L = max (0, 0.4 + 0 - 1) = 0$ . The candidate $- 59.6 < L$ , so set $a^{(1, new)} = 0$ .

Recover $a^{(2)}$ : $ζ = 0.4 \cdot 1 + 0 \cdot 1 = 0.4$ , so $a^{(2, new)} = (0.4 - 0 \cdot 1) /1 = 0.4$ .

The pair $(0.4, 0)$ becomes $(0, 0.4)$ — multiplier mass shifted from example 1 to example 2.

What Could Go Wrong

Numerical instability when $x^{(i)} \approx x^{(j)}$ . The denominator $k_{ii} + k_{jj} - 2 k_{ij}$ approaches zero, blowing up the update. Handle by skipping the pair.
Bad pair selection. The algorithm always converges, but lazy heuristics (e.g., random both) can be orders of magnitude slower than the maximum-step heuristic.
Tolerance too tight. Setting $e$ too small means pairs keep flipping with no real progress. The standard tolerance is around $1 0^{- 3}$ .

Connections

Solves soft-margin-svm — and hard-margin (set $C \to \infty$ , drop the upper bound).
Uses kkt-conditions — the optimality and stopping test.
Why two-variable subproblems are special — the equality constraint $\sum a y = 0$ rules out one-variable updates; two is the minimum that preserves feasibility and admits a closed-form analytic update. Larger subsets would need numerical sub-solvers.
Alternatives — interior-point methods, gradient projection, coordinate descent on the primal hinge-loss form. SMO dominates for kernel SVMs because of the analytic update; for linear SVMs, primal solvers like LIBLINEAR are typically faster.

Course Notes

Explorer