Also called the Radial Basis Function (RBF) kernel. Defined as . The implicit feature map is infinite-dimensional, so we could never compute it explicitly — but with the kernel trick we don’t have to. Its expressiveness and lack of strong assumptions make it the default kernel for non-linear SVMs.

Definition

The output decays from (when ) towards (as the distance grows). The bandwidth controls how quickly: small → sharp peak (only very close points are similar); large → broad peak (more points contribute).

Why It’s Infinite-Dimensional

Expand the kernel as a Taylor series:

Each term corresponds to a polynomial kernel of degree , which has its own finite-dimensional embedding. The Gaussian kernel is an infinite weighted sum of all of them — its embedding has one dimension per monomial of every degree, infinitely many. The kernel computation, of course, is just one exponential.

Validity (Sketch of Mercer Proof)

Starting from the linear kernel (valid by inspection: ), apply composition rules:

  1. is valid (rule: polynomial with non-negative coefficients).
  2. is valid (rule: exponential of a kernel; or rule: sum of valid kernels).
  3. Multiply by with — still valid (rule: ).

The result is exactly (taking for brevity).

Practical Behaviour

  • Default for non-linear SVM. Used when no domain-specific structure suggests a different kernel.
  • Sensitive to . Too small → kernel sees only nearest neighbours, behaves like 1-NN, overfits. Too large → kernel is nearly constant, every point looks similar to every other, underfits. Cross-validate.
  • Sensitive to feature scaling. Distances dominate by the largest-scale feature. Standardise inputs first.
  • Universal approximator. With enough support vectors and the right , the Gaussian-kernel SVM can approximate any decision boundary — at the cost of greater overfitting risk.

On Real Data

Two non-separable point clouds with a Gaussian-kernel SVM. The kernel produces curved, multi-modal boundaries that linear or low-degree-polynomial kernels can’t reach; support vectors (red circles) cluster near the boundary as expected.

Connection to Local Methods

The decision function is a weighted combination of bumps, one per support vector. In that sense, RBF SVM is structurally similar to a parametric form of kernel density classification: it asks how close the test point is to each support vector, weighted by class label and learned multiplier. Locality is built in.

Active Recall