polynomial-kernel

$k (x, z) = (1 + x^{⊤} z)^{p}$ — the kernel whose feature map is all monomials in the inputs of total degree $\leq p$ , with appropriate weights. Useful when you suspect the data has polynomial structure, especially in computer vision where polynomial features model image-feature interactions.

Definition and Embedding

$k (x, z) = (1 + x^{⊤} z)^{p}$

For $p = 2$ and $x \in R^{2}$ , the implicit feature map is:

$ϕ (x) = (1, 2 x_{1}, 2 x_{2}, x_{1}^{2}, x_{2}^{2}, 2 x_{1} x_{2})^{⊤}$

— a 6-dimensional vector that contains every monomial of degree $\leq 2$ , weighted to make the inner product factor cleanly. The kernel computes that inner product as $(1 + x^{⊤} z)^{2}$ , replacing six multiplications with two operations in the input space.

For general $p$ and $D$ -dimensional inputs, the embedding has $(p D + p)$ dimensions — combinatorial in $D$ and $p$ . The kernel evaluation, by contrast, is always one inner product plus one exponentiation.

Validity (Why It’s a Kernel)

Apply the composition rules: the linear kernel $k_{1} (x, z) = x^{⊤} z$ is valid. Adding the constant $1$ (a degenerate kernel: $ϕ = 1$ always) keeps it valid. Raising a valid kernel to a non-negative integer power keeps it valid (rule: polynomial with non-negative coefficients applied to the kernel). So $(1 + x^{⊤} z)^{p}$ is valid for any $p \in Z_{\geq 0}$ .

When to Use

Polynomial structure suspected: image features (where products of pixel intensities encode texture), interactions between numeric variables.
Moderate degree: $p \in {2, 3, 4}$ common. Higher degrees overfit aggressively in high-dimensional input.
As a sanity-check baseline before reaching for a Gaussian kernel.

The trade-off versus the Gaussian kernel:

	Polynomial kernel	Gaussian kernel
Embedding dim	Finite, $O (D^{p})$	Infinite
Hyperparameters	Degree $p$	Bandwidth $σ$
Boundary shape	Polynomial in $x$	Any smooth shape
Built-in regularisation	Yes (degree caps complexity)	No (need explicit regularisation)
Sensitive to feature scale	Yes	Yes (more so)

Active Recall

For 100-dimensional input, the polynomial kernel of degree 5 has roughly $1 0^{7}$ implicit feature dimensions. How is this different in cost from running an SVM in that explicit feature space?

In the primal, you’d need to materialise every $ϕ (x^{(n)})$ — $N \times 1 0^{7}$ numbers — and solve a QP with $1 0^{7}$ variables. Memory and time both blow up. With the kernel, you only ever evaluate $k (x, z) = (1 + x^{⊤} z)^{5}$ — one 100-dim inner product plus one exponentiation. The dual QP has $N$ variables (one per training point), independent of polynomial degree. The kernel trick converts what would be a $1 0^{7}$ -dimensional optimisation into an $N$ -dimensional one.

kernel-trick — what makes the polynomial embedding tractable
non-linear-transformation — explicit polynomial basis expansion is the primal version
gaussian-kernel — alternative non-linear kernel; richer but unbounded in expressiveness
mercers-condition — validity proof relies on the composition rules

Course Notes

Explorer

polynomial-kernel

Definition and Embedding

Validity (Why It’s a Kernel)

When to Use

Active Recall

Graph View

Table of Contents

Backlinks

Course Notes

Explorer

polynomial-kernel

Definition and Embedding

Validity (Why It’s a Kernel)

When to Use

Active Recall

Related

Graph View

Table of Contents

Backlinks