mercers-condition

A function $k (x, z)$ is a valid kernel iff it corresponds to an inner product in some feature space $ϕ$ . Mercer’s condition is the test: for any finite set of points, the Gram matrix $K_{ij} = k (x^{(i)}, x^{(j)})$ must be symmetric and positive semidefinite.

Statement

Given any finite set of points ${x^{(1)}, \dots, x^{(M)}}$ , build the $M \times M$ Gram (kernel) matrix:

$K_{ij} = k (x^{(i)}, x^{(j)})$

Mercer’s condition requires that for every such finite set:

Symmetric: $k (x^{(i)}, x^{(j)}) = k (x^{(j)}, x^{(i)})$ .
Positive semidefinite: $z^{⊤} K z \geq 0$ for all $z \in R^{M}$ .

If both hold, $k$ corresponds to the inner product $ϕ (x)^{⊤} ϕ (z)$ for some feature map $ϕ$ (which Mercer’s theorem constructs but doesn’t require us to compute).

Why These Conditions

Inner products must be symmetric ( $⟨ u, v ⟩ = ⟨ v, u ⟩$ ) and produce non-negative self-products ( $⟨ v, v ⟩ \geq 0$ ). Mercer’s condition demands the kernel inherit those properties at every scale — for any sample of points, the implied geometry is consistent with being an inner product in a Hilbert space.

Algorithmically, positive semidefiniteness is what keeps the SVM dual a convex QP. If $K$ has a negative eigenvalue, the dual objective is no longer concave and the optimisation can wander to nonsense.

Kernel Composition Rules

Verifying Mercer’s condition from scratch is often hard. Instead, build kernels by composition: starting from validated kernels $k_{1}, k_{2}$ , the following are also valid:

Rule	Description
$k = c k_{1}$	Scaling by a constant $c \geq 0$
$k = f (x) k_{1} f (z)$	Multiplying by any function on each side
$k = q (k_{1})$	Polynomial of $k_{1}$ with non-negative coefficients
$k = e^{k_{1}}$	Exponential of a kernel
$k = k_{1} + k_{2}$	Sum of kernels
$k = k_{1} \cdot k_{2}$	Product of kernels

The Gaussian kernel can be derived as a chain of these rules starting from the linear kernel $k_{1} = x^{⊤} z$ (see gaussian-kernel for the proof).

Active Recall

A friend proposes using $k (x, z) = - ∥ x - z ∥^{2}$ as a kernel because it captures "similarity" (closer points get larger values, since the values are less negative). Is it a valid kernel?

No. Take any two distinct points $x^{(1)} \neq = x^{(2)}$ . The Gram matrix is $K = (0 - d^{2} - d^{2} 0)$ where $d = ∥ x^{(1)} - x^{(2)} ∥$ . This has eigenvalues $\pm d^{2}$ — the negative eigenvalue means $K$ is not positive semidefinite, so Mercer’s condition fails. The function is “similarity-shaped” but doesn’t correspond to any inner product geometry.

Why does verifying Mercer's condition matter — couldn't we just plug any function into the SVM dual and see what happens?

Two failure modes. Mathematically, a non-Mercer function may produce a Gram matrix with negative eigenvalues; the dual objective is then non-concave, the optimisation problem is non-convex, and there are no guarantees of finding the global optimum. Geometrically, the resulting decision function doesn’t correspond to a hyperplane in any feature space, so the margin interpretation evaporates and predictions can be wildly wrong even on the training set. The “trick” only works when the kernel really does compute an inner product.

kernel-trick — Mercer’s condition is the gate that determines which functions can be used as kernels
gaussian-kernel — proof of validity uses the composition rules
polynomial-kernel — derived from a polynomial of the linear kernel (rule 3)

Course Notes

Explorer

mercers-condition

Statement

Why These Conditions

Kernel Composition Rules

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

mercers-condition

Statement

Why These Conditions

Kernel Composition Rules

Active Recall

Related

Graph View

Table of Contents

Backlinks