A function is a valid kernel iff it corresponds to an inner product in some feature space . Mercer’s condition is the test: for any finite set of points, the Gram matrix must be symmetric and positive semidefinite.
Statement
Given any finite set of points , build the Gram (kernel) matrix:
Mercer’s condition requires that for every such finite set:
- Symmetric: .
- Positive semidefinite: for all .
If both hold, corresponds to the inner product for some feature map (which Mercer’s theorem constructs but doesn’t require us to compute).
Why These Conditions
Inner products must be symmetric () and produce non-negative self-products (). Mercer’s condition demands the kernel inherit those properties at every scale — for any sample of points, the implied geometry is consistent with being an inner product in a Hilbert space.
Algorithmically, positive semidefiniteness is what keeps the SVM dual a convex QP. If has a negative eigenvalue, the dual objective is no longer concave and the optimisation can wander to nonsense.
Kernel Composition Rules
Verifying Mercer’s condition from scratch is often hard. Instead, build kernels by composition: starting from validated kernels , the following are also valid:
| Rule | Description |
|---|---|
| Scaling by a constant | |
| Multiplying by any function on each side | |
| Polynomial of with non-negative coefficients | |
| Exponential of a kernel | |
| Sum of kernels | |
| Product of kernels |
The Gaussian kernel can be derived as a chain of these rules starting from the linear kernel (see gaussian-kernel for the proof).
Active Recall
A friend proposes using as a kernel because it captures "similarity" (closer points get larger values, since the values are less negative). Is it a valid kernel?
No. Take any two distinct points . The Gram matrix is where . This has eigenvalues — the negative eigenvalue means is not positive semidefinite, so Mercer’s condition fails. The function is “similarity-shaped” but doesn’t correspond to any inner product geometry.
Why does verifying Mercer's condition matter — couldn't we just plug any function into the SVM dual and see what happens?
Two failure modes. Mathematically, a non-Mercer function may produce a Gram matrix with negative eigenvalues; the dual objective is then non-concave, the optimisation problem is non-convex, and there are no guarantees of finding the global optimum. Geometrically, the resulting decision function doesn’t correspond to a hyperplane in any feature space, so the margin interpretation evaporates and predictions can be wildly wrong even on the training set. The “trick” only works when the kernel really does compute an inner product.
Related
- kernel-trick — Mercer’s condition is the gate that determines which functions can be used as kernels
- gaussian-kernel — proof of validity uses the composition rules
- polynomial-kernel — derived from a polynomial of the linear kernel (rule 3)