kernel-trick

A “trick” because we get the benefit of working in a high-dimensional embedding $ϕ (x)$ — sometimes infinite-dimensional — without ever computing it. Whenever an algorithm depends on data only through inner products $ϕ (x)^{⊤} ϕ (z)$ , we can replace those inner products with a kernel function $k (x, z)$ that returns the same value computed entirely in the original space.

What is a Kernel Function?

A kernel function takes two points in the original input space and returns a real number:

$k : X \times X \to R, k (x, z) = ϕ (x)^{⊤} ϕ (z)$

It is the inner product of the two points after they’ve been mapped into some feature space $ϕ$ . Conceptually, $k$ measures similarity — how aligned the two vectors are in the embedded space.

The trick: we don’t have to compute $ϕ$ to evaluate $k$ . For many useful $ϕ$ , there’s a closed-form expression for $k$ that uses only the original-space coordinates.

Worked Example: Polynomial Kernel of Degree 2

Take $x = (x_{1}, x_{2})^{⊤}$ and the embedding:

$ϕ (x) = (1, 2 x_{1}, 2 x_{2}, x_{1}^{2}, x_{2}^{2}, 2 x_{1} x_{2})^{⊤}$

(The $2$ factors are chosen to simplify the algebra below.) Computing the inner product $ϕ (x)^{⊤} ϕ (z)$ explicitly:

$ϕ (x)^{⊤} ϕ (z) = 1 + 2 x_{1} z_{1} + 2 x_{2} z_{2} + x_{1}^{2} z_{1}^{2} + x_{2}^{2} z_{2}^{2} + 2 x_{1} x_{2} z_{1} z_{2}$

This factors as $(1 + x_{1} z_{1} + x_{2} z_{2})^{2} = (1 + x^{⊤} z)^{2}$ . So:

$k (x, z) = (1 + x^{⊤} z)^{2}$

Two operations in the original space (one inner product, one square) replace six multiplications in the 6-dimensional feature space. As the polynomial degree grows, the saving compounds.

Why the Dual Lets Us Use Kernels

Recall the SVM dual:

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n} \sum_{m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$

The data appears only through pairwise inner products $ϕ (x^{(n)})^{⊤} ϕ (x^{(m)})$ . Replacing each with $k (x^{(n)}, x^{(m)})$ makes the entire optimisation independent of $ϕ$ :

$ar g max_{a} \sum_{n} a^{(n)} - \frac{1}{2} \sum_{n} \sum_{m} a^{(n)} a^{(m)} y^{(n)} y^{(m)} k (x^{(n)}, x^{(m)})$

Same story for prediction:

$h (x) = \sum_{n \in S} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$

where $S$ is the set of support-vector indices. Train and predict, all without forming $ϕ$ once.

Kernel Functions Without an Explicit $ϕ$

The deeper insight: we don’t need to design $ϕ$ first. We can write down a similarity function $k (x, z)$ directly, as long as it corresponds to some valid inner product in some feature space. The condition that decides which functions qualify is Mercer’s condition (see mercers-condition).

This unlocks two things:

Infinite-dimensional embeddings: the RBF kernel $k (x, z) = e^{- ∥ x - z ∥^{2} /2 σ^{2}}$ corresponds to an infinite-dimensional $ϕ$ , which we’d never be able to compute explicitly.
Non-numeric data: kernels can be defined directly on strings, trees, graphs, sets — anything for which a similarity function exists. The “feature space” $ϕ$ is implicit, never realised.

When the Kernel Trick Pays Off

In the primal, computation per training example is $O (dim ϕ)$ . In the dual with a kernel, it’s $O (dim x)$ per kernel evaluation, times $N^{2}$ pairs. Dual is preferred when:

$dim ϕ ≫ N$ (high-dimensional or infinite embedding, modest training set), or
$ϕ$ is implicit (no closed form), or
the input space is non-numeric.

When $dim ϕ$ is small relative to $N$ (e.g., low-degree polynomial, large dataset), the primal can actually be cheaper.

Active Recall

The Gaussian kernel corresponds to an infinite-dimensional embedding. Why is this not a computational disaster?

Because we never compute the embedding. The kernel $k (x, z) = e^{- ∥ x - z ∥^{2} /2 σ^{2}}$ is evaluated entirely in the original $x$ -space — one subtraction, one squared norm, one exponential. The infinite dimensionality is structural (the Taylor series of $e^{x^{⊤} z}$ contains arbitrarily high-order monomials), but invisible to the algorithm. We get the expressive power of an infinite feature space at the cost of evaluating a single 2-argument function.

Can any function I think captures "similarity" be used as a kernel?

No. The function must correspond to an inner product in some embedding — equivalently, the Gram matrix $K_{ij} = k (x^{(i)}, x^{(j)})$ must be symmetric and positive semidefinite for any finite set of inputs (Mercer’s condition). Otherwise the dual SVM optimisation may be ill-posed (non-convex), produce nonsense predictions, or break the geometric guarantees. In practice we either construct $k$ from validated building blocks (linear, polynomial, Gaussian) using composition rules, or verify Mercer’s condition directly.

The dual SVM's prediction is $h (x) = \sum_{n \in S} a^{(n)} y^{(n)} k (x, x^{(n)}) + b$ . Why does the prediction time depend on the number of support vectors, not the number of training examples?

Because of complementary slackness: at the optimum, $a^{(n)} = 0$ for all non-support-vectors. Their contribution to the sum vanishes regardless of $k (x, x^{(n)})$ . Storing only the support vectors (with their $a^{(n)}$ and $y^{(n)}$ ) is sufficient to make any future prediction. This is a key efficiency: a typical SVM trained on $N$ points might use only $\sim N$ or fewer support vectors, making prediction dramatically faster than re-evaluating against the full training set.

lagrangian — duality is what produces a kernel-friendly optimisation
mercers-condition — when a candidate function is a valid kernel
gaussian-kernel — most popular kernel; infinite-dimensional embedding
polynomial-kernel — generalises the worked example above
support-vector-machine — the canonical algorithm where the trick lives

Course Notes

Explorer

kernel-trick

What is a Kernel Function?

Worked Example: Polynomial Kernel of Degree 2

Why the Dual Lets Us Use Kernels

Kernel Functions Without an Explicit $ϕ$

When the Kernel Trick Pays Off

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

kernel-trick

What is a Kernel Function?

Worked Example: Polynomial Kernel of Degree 2

Why the Dual Lets Us Use Kernels

Kernel Functions Without an Explicit ϕ

When the Kernel Trick Pays Off

Active Recall

Related

Graph View

Table of Contents

Backlinks

Kernel Functions Without an Explicit $ϕ$