week-05

TARGET DECK MachineLearning::Week-05

Soft-Margin SVM

Why do we need a soft-margin SVM?

The hard-margin constraint $y^{(n)} h (x^{(n)}) \geq 1$ for all $n$ requires every point to be correctly classified beyond the margin. If even one point can’t satisfy that — because classes overlap, or noise has shifted a point to the wrong side — the constraint set is empty and there’s no solution. Real data usually has overlap or label noise, so hard-margin is too brittle.

Write the soft-margin SVM primal problem.

$ar g min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{n = 1}^{N} ξ^{(n)}$ $subject to y^{(n)} (w^{⊤} x^{(n)} + b) \geq 1 - ξ^{(n)}, ξ^{(n)} \geq 0 \forall n$ Each training example gets a personal violation budget $ξ^{(n)} \geq 0$ . The hyperparameter $C > 0$ trades margin width against violation tolerance.

What is a slack variable $ξ^{(n)}$ , and how does its value encode a point's position?

$ξ^{(n)} \geq 0$ is the per-example margin-violation budget. At the optimum, $ξ^{(n)} = max (0, 1 - y^{(n)} h (x^{(n)}))$ — exactly the hinge loss. Position by value:

$ξ = 0$ : outside or on the margin envelope

$ξ \in (0, 1)$ : inside the margin, correctly classified

$ξ = 1$ : exactly on the decision boundary

$ξ > 1$ : misclassified — wrong side of the boundary

What does the hyperparameter $C$ control in soft-margin SVM, and what happens at the extremes?

$C > 0$ sets the price of slack:

Large $C$ : slack is expensive → narrow margin, behaves like hard-margin (many SVs, risk of overfitting).

Small $C$ : slack is cheap → wide margin, many margin-violators tolerated (risk of underfitting; in extreme case, model collapses to “always predict majority class”).

$C$ is the bias-variance trade-off knob, typically chosen by cross-validation.

What is the soft-margin SVM dual, and how does it differ from the hard-margin dual?

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n, m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})$ subject to $\sum_{n} a^{(n)} y^{(n)} = 0$ and $0 \leq a^{(n)} \leq C$ .

The only difference from hard-margin: the box constraint $a^{(n)} \leq C$ . Slack and its multipliers vanish in the dual derivation; the entire effect of slack collapses to one upper bound.

What are the three categories of training point in a soft-margin SVM, indexed by $a^{(n)}$ and $ξ^{(n)}$ ?

$a^{(n)}$ $ξ^{(n)}$ Type Position
$0$ $0$ Not a SV Strictly outside margin
$\in (0, C)$ $0$ Margin SV Exactly on margin: $y h = 1$
$C$ $\geq 0$ Bound SV On margin / inside / misclassified

Only margin SVs can be used to compute $b$ — bound SVs have $y h \neq = 1$ in general.

$a^{(n)}$	$ξ^{(n)}$	Type	Position
$0$	$0$	Not a SV	Strictly outside margin
$\in (0, C)$	$0$	Margin SV	Exactly on margin: $y h = 1$
$C$	$\geq 0$	Bound SV	On margin / inside / misclassified

Why is sparsity weakened (but preserved) in soft-margin SVM compared to hard-margin?

Hard-margin: only points exactly on the margin have $a^{(n)} > 0$ — typically a tiny fraction. Soft-margin: every margin violator also has $a^{(n)} = C > 0$ , expanding the active set. But points comfortably outside the margin still have $a^{(n)} = 0$ and contribute nothing to predictions. So sparsity weakens (more support vectors) but doesn’t vanish.

Sequential Minimal Optimization

What is Sequential Minimal Optimization (SMO), and why update two multipliers at a time rather than one?

SMO solves the SVM dual by picking a small subset of multipliers, optimising over them analytically, and iterating. The minimum subset size is two because of the equality constraint $\sum_{n} a^{(n)} y^{(n)} = 0$ : $a^{(m)} y^{(m)} = - \sum_{n \neq = m} a^{(n)} y^{(n)} = constant$ Picking just one multiplier with everything else fixed pins it in place — no movement possible. Two multipliers can move while preserving the constraint.

What is the analytic update in SMO for the second multiplier $a^{(j)}$ ?

$a^{(j, new)} = a^{(j)} + \frac{y ^{(j)} ( E ^{(i)} - E ^{(j)} )}{k ( x ^{(i)} , x ^{(i)} ) + k ( x ^{(j)} , x ^{(j)} ) - 2 k ( x ^{(i)} , x ^{(j)} )}$ where $E^{(n)} = f (x^{(n)}) - y^{(n)}$ is the prediction error. The denominator is $∥ ϕ (x^{(i)}) - ϕ (x^{(j)}) ∥^{2}$ — the squared distance between the two examples in feature space. The numerator scales with error disagreement.

After computing the analytic SMO update, why must the result be clipped, and what is the clipping interval?

Because the candidate update may push $a^{(j)}$ out of $[0, C]$ or push the paired $a^{(i)}$ out of its box. The feasible interval $[L, H]$ depends on whether the labels match:

Same labels ( $y^{(i)} = y^{(j)}$ ): $L = max (0, a^{(j)} + a^{(i)} - C), H = min (C, a^{(j)} + a^{(i)})$

Different labels ( $y^{(i)} \neq = y^{(j)}$ ): $L = max (0, a^{(j)} - a^{(i)}), H = min (C, a^{(j)} - a^{(i)} + C)$

Then clip: $a^{(j, new&clipped)} = clip (a^{(j, new)}, L, H)$ , and recover $a^{(i)}$ from the equality constraint.

What is the maximum-step heuristic for choosing the second multiplier in SMO?

After picking $a^{(i)}$ , choose $j$ to maximise $∣ E^{(i)} - E^{(j)} ∣$ — the largest disagreement between cached prediction errors. This approximates the largest update size and accelerates convergence. SMO converges for any pair-selection rule, but heuristics matter hugely — random-random pair selection can be 10×–100× slower.

Restate the SVM KKT optimality conditions in terms of $a^{(n)}$ alone.

Condition Geometric meaning
$a^{(n)} = 0 ⟺ y^{(n)} h (x^{(n)}) \geq 1$ Outside margin; not a SV
$0 < a^{(n)} < C ⟺ y^{(n)} h (x^{(n)}) = 1$ On margin; margin SV
$a^{(n)} = C ⟺ y^{(n)} h (x^{(n)}) \leq 1$ On / inside / violating margin; bound SV

When every example satisfies its KKT condition within tolerance, the dual is at its optimum.

Condition	Geometric meaning
$a^{(n)} = 0 ⟺ y^{(n)} h (x^{(n)}) \geq 1$	Outside margin; not a SV
$0 < a^{(n)} < C ⟺ y^{(n)} h (x^{(n)}) = 1$	On margin; margin SV
$a^{(n)} = C ⟺ y^{(n)} h (x^{(n)}) \leq 1$	On / inside / violating margin; bound SV

Practical Comparison

Why is the soft-margin SVM dual still a convex quadratic program despite the added slack?

Because in the dual, slack disappears entirely — only the box constraint $0 \leq a^{(n)} \leq C$ remains. The dual objective is concave (negative-definite quadratic form in $a$ via the kernel Gram matrix), and the constraints are linear. Standard convex QP. Strong duality still holds because the primal is convex.

Why use only margin support vectors (and not bound SVs) to compute the bias $b$ ?

A margin SV has $0 < a^{(n)} < C$ , which forces $ξ^{(n)} = 0$ via complementary slackness, which pins the point exactly on the margin: $y^{(n)} h (x^{(n)}) = 1$ . Solving for $b$ gives the correct value. Bound SVs ( $a^{(n)} = C$ ) have $ξ^{(n)} \geq 0$ , so $y^{(n)} h (x^{(n)}) = 1 - ξ^{(n)}$ generally $\neq = 1$ — substituting gives the wrong $b$ . Average over all margin SVs for numerical stability.

Course Notes

Explorer

week-05

Soft-Margin SVM

Sequential Minimal Optimization

Practical Comparison

Graph View

Table of Contents