soft-margin-svm

An extension of the hard-margin SVM that permits training points to sit inside the margin or even on the wrong side of the decision boundary, paying a per-example penalty $ξ^{(n)} \geq 0$ for each violation. A hyperparameter $C > 0$ controls the trade-off between margin width and total violation.

Why Hard-Margin Isn’t Enough

The hard-margin SVM solves:

$ar g min_{w, b} \frac{1}{2} ∥ w ∥^{2} s.t. y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 \forall n$

Two failure modes:

Non-separable data. If no hyperplane in $ϕ$ -space separates the classes, the constraint set is empty and the problem has no feasible solution.
Overfitting on barely-separable data. Even when the data is separable, a single noisy or mislabelled point near the boundary forces the hyperplane into a contorted position with a tiny margin. With kernels (especially Gaussian), $ϕ$ -space is high- or infinite-dimensional, and some separating boundary almost always exists — but it’s the wrong one.

The fix: stop demanding perfect separation. Allow each example a controlled amount of margin violation.

The Primal Formulation

Introduce a slack variable $ξ^{(n)} \geq 0$ for each training example, relaxing the margin constraint:

$y^{(n)} (w^{⊤} ϕ (x^{(n)}) + b) \geq 1 - ξ^{(n)}$

Penalise total slack in the objective:

$ar g w, b, ξ min \frac{1}{2} ∥ w ∥^{2} + C n = 1 \sum N ξ^{(n)} s.t. y^{(n)} h (x^{(n)}) \geq 1 - ξ^{(n)}, ξ^{(n)} \geq 0$

Three things changed from hard-margin:

The constraint right-hand side is now $1 - ξ^{(n)}$ instead of $1$ — points can violate the margin by an amount $ξ^{(n)}$ .
The objective gains $C \sum_{n} ξ^{(n)}$ — total violation is penalised.
The optimisation now has $N$ extra variables ( $ξ$ ).

The margin is now simply $1/∥ w ∥$ : there is no longer a “closest training point” definition because some points might be inside the margin.

The Hyperparameter $C$

$C$ trades off margin width against violation tolerance:

Regime	Behaviour	Risk
$C \to \infty$	Slacks heavily penalised; recovers hard-margin behaviour	Overfit, especially with high-capacity kernels
Large $C$	Few/small violations; narrow margin	Overfit, may not generalise
Small $C$	Many violations allowed; wide margin	Underfit, may misclassify too aggressively
$C \to 0$	Slack is free; margin maximised at all costs	Boundary collapses (any classification works)

Set $C$ via cross-validation. The sklearn SVC default is $C = 1.0$ .

The Dual Formulation

Repeat the Lagrangian / KKT machinery from week 4, this time with two sets of multipliers — $a^{(n)} \geq 0$ for the margin constraint and $β^{(n)} \geq 0$ for $ξ^{(n)} \geq 0$ :

$L = \frac{1}{2} ∥ w ∥^{2} + C \sum_{n} ξ^{(n)} + \sum_{n} a^{(n)} (1 - ξ^{(n)} - y^{(n)} h (x^{(n)})) - \sum_{n} β^{(n)} ξ^{(n)}$

Setting partials to zero:

$\partial L / \partial w = 0 ⟹ w^{*} = \sum_{n} a^{(n)} y^{(n)} ϕ (x^{(n)})$ — same as hard-margin.
$\partial L / \partial b = 0 ⟹ \sum_{n} a^{(n)} y^{(n)} = 0$ — same as hard-margin.
$\partial L / \partial ξ^{(n)} = 0 ⟹ C - a^{(n)} - β^{(n)} = 0 ⟹ β^{(n)} = C - a^{(n)}$ — new.

Combining $β^{(n)} \geq 0$ with $a^{(n)} \geq 0$ gives the box constraint:

$0 \leq a^{(n)} \leq C$

Substituting back, $ξ^{(n)}$ and $β^{(n)}$ vanish entirely from the objective — and we get exactly the same dual as hard-margin, but with $a^{(n)}$ now upper-bounded by $C$ :

$ar g a max \tilde{L} (a) = n \sum a^{(n)} - \frac{1}{2} n, m \sum a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})$

subject to $0 \leq a^{(n)} \leq C$ and $\sum_{n} a^{(n)} y^{(n)} = 0$ .

The kernel trick still applies — only inner products appear. The dual is almost identical to the hard-margin one. The single difference, $a^{(n)} \leq C$ , captures the entire effect of the slack relaxation.

KKT Conditions and Support Vector Categories

The complementary slackness conditions for soft-margin SVM are:

$a^{(n)} (1 - ξ^{(n)} - y^{(n)} h (x^{(n)})) = 0$ $(C - a^{(n)}) ξ^{(n)} = 0 (equivalently β^{(n)} ξ^{(n)} = 0)$

These partition training points into three categories based on the value of $a^{(n)}$ :

$a^{(n)}$	$ξ^{(n)}$	Type	Geometric position
$a^{(n)} = 0$	$ξ^{(n)} = 0$	Not a support vector	Strictly outside margin envelope, satisfies $y h \geq 1$
$0 < a^{(n)} < C$	$ξ^{(n)} = 0$	Margin support vector	Exactly on the margin: $y h = 1$
$a^{(n)} = C$	$ξ^{(n)} \geq 0$	Bound (clipped) support vector	On margin ( $ξ = 0$ ), inside margin ( $0 < ξ < 1$ ), on decision boundary ( $ξ = 1$ ), or misclassified ( $ξ > 1$ )

Why $0 < a < C$ pins the point exactly on the margin

If $0 < a^{(n)} < C$ then $β^{(n)} = C - a^{(n)} > 0$ , so the second complementary-slackness condition forces $ξ^{(n)} = 0$ . And $a^{(n)} > 0$ forces the first condition’s bracket to vanish: $1 - 0 - y h = 0$ , i.e., $y h = 1$ . The point sits exactly on the margin.

The support-vector population in soft-margin SVM is larger than in hard-margin: every margin-violating example contributes ( $a^{(n)} = C$ ), in addition to the on-margin ones. Sparsity weakens but is preserved — examples comfortably outside the margin still have $a^{(n)} = 0$ .

Predictions

Identical in form to the hard-margin dual:

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$

For $b$ , average over the margin support vectors only (those with $0 < a^{(n)} < C$ , where $y h = 1$ exactly):

$b = \frac{1}{∣ M ∣} \sum_{n \in M} (y^{(n)} - \sum_{m \in S} a^{(m)} y^{(m)} k (x^{(n)}, x^{(m)}))$

where $M = {n : 0 < a^{(n)} < C}$ . Don’t use bound support vectors ( $a^{(n)} = C$ ) — their $y h = 1 - ξ^{(n)} \neq = 1$ in general, so they’d give wrong values for $b$ .

What Could Go Wrong

Wrong $C$ . Too small and the model underfits (everything gets absorbed into slack); too large and we’re back to overfitting via hard-margin behaviour. Always cross-validate.
No margin support vectors. If every support vector is at $a^{(n)} = C$ , the formula above for $b$ has no inputs. In practice this is rare but possible; pick any of them and accept the resulting $b$ (or add small regularisation).
Unscaled features. Same as hard-margin — kernels depend on $x^{⊤} z$ or $∥ x - z ∥$ . Standardise before training.

How It’s Solved

The soft-margin dual is a convex QP with $N$ variables and a linear equality constraint plus box constraints — solvable in principle by off-the-shelf QP solvers, but $O (N^{2})$ memory and $O (N^{3})$ time make this prohibitive for large $N$ . The standard approach is Sequential Minimal Optimization (SMO), which decomposes into two-variable subproblems each solvable analytically.

Connections

Builds on support-vector-machine — relaxes the hard-margin constraint without changing the dual structure.
Uses slack-variables — the per-example penalty mechanism.
Solved by sequential-minimal-optimization — analytic two-multiplier updates.
Complements kernel-trick — kernels handle non-linearity; slack handles non-separability. In practice you nearly always combine them: RBF kernel + soft margin is the default SVC.

Course Notes

Explorer

soft-margin-svm

Why Hard-Margin Isn’t Enough

The Primal Formulation

The Hyperparameter $C$

The Dual Formulation

KKT Conditions and Support Vector Categories

Predictions

What Could Go Wrong

How It’s Solved

Connections

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

soft-margin-svm

Why Hard-Margin Isn’t Enough

The Primal Formulation

The Hyperparameter C

The Dual Formulation

KKT Conditions and Support Vector Categories

Predictions

What Could Go Wrong

How It’s Solved

Connections

Graph View

Table of Contents

Backlinks

The Hyperparameter $C$