slack-variables

A non-negative scalar $ξ^{(n)} \geq 0$ attached to each training example, measuring how far that example slips inside or across the SVM margin envelope. Slack variables convert the hard “everyone outside the margin” constraint into a budgeted penalty — and turn an infeasible separation problem into a tractable one.

The Constraint Modification

The hard-margin SVM requires every example to satisfy:

$y^{(n)} h (x^{(n)}) \geq 1$

If the data is not linearly separable (in $ϕ$ -space), this constraint set is empty and the optimisation has no feasible point. Slack relaxes the right-hand side by a non-negative amount:

$y^{(n)} h (x^{(n)}) \geq 1 - ξ^{(n)}, ξ^{(n)} \geq 0$

Each $ξ^{(n)}$ is a per-example quota for “how much we’re willing to bend the margin rule for this point.”

Geometric Interpretation

The value of $ξ^{(n)}$ encodes the position of point $n$ relative to the margin and decision boundary:

$ξ^{(n)}$	Geometric meaning
$ξ^{(n)} = 0$	Correctly classified, on or outside the margin (constraint $y h \geq 1$ holds)
$ξ^{(n)} \in (0, 1)$	Correctly classified, but inside the margin envelope
$ξ^{(n)} = 1$	Sits exactly on the decision boundary ( $h (x^{(n)}) = 0$ )
$ξ^{(n)} > 1$	Misclassified — on the wrong side of the decision boundary

A useful identity: $ξ^{(n)} = max (0, 1 - y^{(n)} h (x^{(n)}))$ — the hinge loss of example $n$ . If the example already satisfies the margin, $ξ^{(n)} = 0$ ; otherwise, $ξ^{(n)}$ measures the shortfall.

Where Slack Enters the Objective

The slack-augmented soft-margin SVM objective is:

$ar g min_{w, b, ξ} \frac{1}{2} ∥ w ∥^{2} + C \sum_{n = 1}^{N} ξ^{(n)}$

The sum $\sum_{n} ξ^{(n)}$ is the total margin violation across the training set, weighted by the hyperparameter $C$ . Note: we minimise the sum of values, not the count of non-zero slacks. A misclassified example with $ξ = 1.5$ costs more than two margin-grazing examples with $ξ = 0.4$ each.

Slack on every example, even those that don't need it

Every training point gets its own $ξ^{(n)}$ in the formulation — including ones that are happily outside the margin. At the optimum those have $ξ^{(n)} = 0$ , but they are still variables in the problem. The slack is never added selectively to violators; it is allocated to all and pinned to zero where the data permits.

Why Slack Always Hits the Constraint With Equality

At the optimum, whenever $ξ^{(n)} > 0$ for any $C > 0$ , the margin constraint is satisfied with equality:

$y^{(n)} h (x^{(n)}) = 1 - ξ^{(n)}$

This follows from KKT complementary slackness — see soft-margin-svm for the derivation. The practical upshot: $ξ^{(n)}$ is not just a budget upper bound on the violation, it equals the violation exactly.

Why Slack Doesn’t Run Away

If slack carried no penalty, the optimiser would dial every $ξ^{(n)}$ to infinity, making all constraints trivially satisfiable and shrinking $∥ w ∥$ to zero. The hyperparameter $C$ prevents this:

Large $C$ : each unit of slack is expensive → the optimiser tolerates fewer/smaller margin violations → narrow margin, may overfit
Small $C$ : slack is cheap → many points allowed inside or across the margin → wide margin, may underfit

See soft-margin-svm for the full $C$ -tradeoff discussion.

Active-Recall Questions

What does $ξ^{(n)} = 0.4$ tell you about example $n$ ?

The example is correctly classified but lies inside the margin envelope. The margin constraint is satisfied with equality at $y h = 0.6$ , so the point sits 40% of the way from the margin towards the decision boundary, on the correct side.

A misclassified example always has larger slack than a correctly-classified one inside the margin. True or false?

True. Misclassified means $y h < 0$ , which forces $ξ = 1 - y h > 1$ . Correctly-classified-but-inside-margin means $0 < y h < 1$ , which gives $ξ = 1 - y h \in (0, 1)$ . So misclassified $ξ > 1 >$ inside-margin $ξ$ .

However, between two misclassified examples, the one deeper into the wrong territory (more negative $y h$ ) has the larger slack. A barely-misclassified point can have $ξ$ only slightly above 1.

Why minimise $\sum_{n} ξ^{(n)}$ rather than $∣ {n : ξ^{(n)} > 0} ∣$ ?

The count is non-convex and combinatorial — it would make the problem NP-hard. The sum is a convex relaxation: it preserves the ranking (more violation → worse objective) while keeping the QP tractable. As a side effect, the sum penalises severity not just occurrence: one badly-misclassified point hurts more than several margin-grazing ones.

Connections

Component of soft-margin-svm — slack variables are the structural mechanism that makes SVM work on non-separable data.
Hinge loss — $ξ^{(n)} = max (0, 1 - y^{(n)} h (x^{(n)}))$ is the per-example hinge loss; soft-margin SVM is equivalent to minimising mean hinge loss + L2 regularisation.
Why slack ≠ misclassification indicator — examples with $0 < ξ < 1$ are correctly classified; only $ξ > 1$ implies misclassification. The “slack count” overcounts errors.

Course Notes

Explorer

slack-variables

The Constraint Modification

Geometric Interpretation

Where Slack Enters the Objective

Why Slack Always Hits the Constraint With Equality

Why Slack Doesn’t Run Away

Active-Recall Questions

Connections

Graph View

Table of Contents

Backlinks