A “trick” because we get the benefit of working in a high-dimensional embedding — sometimes infinite-dimensional — without ever computing it. Whenever an algorithm depends on data only through inner products , we can replace those inner products with a kernel function that returns the same value computed entirely in the original space.

What is a Kernel Function?

A kernel function takes two points in the original input space and returns a real number:

It is the inner product of the two points after they’ve been mapped into some feature space . Conceptually, measures similarity — how aligned the two vectors are in the embedded space.

The trick: we don’t have to compute to evaluate . For many useful , there’s a closed-form expression for that uses only the original-space coordinates.

Worked Example: Polynomial Kernel of Degree 2

Take and the embedding:

(The factors are chosen to simplify the algebra below.) Computing the inner product explicitly:

This factors as . So:

Two operations in the original space (one inner product, one square) replace six multiplications in the 6-dimensional feature space. As the polynomial degree grows, the saving compounds.

Why the Dual Lets Us Use Kernels

Recall the SVM dual:

The data appears only through pairwise inner products . Replacing each with makes the entire optimisation independent of :

Same story for prediction:

where is the set of support-vector indices. Train and predict, all without forming once.

Kernel Functions Without an Explicit

The deeper insight: we don’t need to design first. We can write down a similarity function directly, as long as it corresponds to some valid inner product in some feature space. The condition that decides which functions qualify is Mercer’s condition (see mercers-condition).

This unlocks two things:

  1. Infinite-dimensional embeddings: the RBF kernel corresponds to an infinite-dimensional , which we’d never be able to compute explicitly.
  2. Non-numeric data: kernels can be defined directly on strings, trees, graphs, sets — anything for which a similarity function exists. The “feature space” is implicit, never realised.

When the Kernel Trick Pays Off

In the primal, computation per training example is . In the dual with a kernel, it’s per kernel evaluation, times pairs. Dual is preferred when:

  • (high-dimensional or infinite embedding, modest training set), or
  • is implicit (no closed form), or
  • the input space is non-numeric.

When is small relative to (e.g., low-degree polynomial, large dataset), the primal can actually be cheaper.

Active Recall