— the kernel whose feature map is all monomials in the inputs of total degree , with appropriate weights. Useful when you suspect the data has polynomial structure, especially in computer vision where polynomial features model image-feature interactions.
Definition and Embedding
For and , the implicit feature map is:
— a 6-dimensional vector that contains every monomial of degree , weighted to make the inner product factor cleanly. The kernel computes that inner product as , replacing six multiplications with two operations in the input space.
For general and -dimensional inputs, the embedding has dimensions — combinatorial in and . The kernel evaluation, by contrast, is always one inner product plus one exponentiation.
Validity (Why It’s a Kernel)
Apply the composition rules: the linear kernel is valid. Adding the constant (a degenerate kernel: always) keeps it valid. Raising a valid kernel to a non-negative integer power keeps it valid (rule: polynomial with non-negative coefficients applied to the kernel). So is valid for any .
When to Use
- Polynomial structure suspected: image features (where products of pixel intensities encode texture), interactions between numeric variables.
- Moderate degree: common. Higher degrees overfit aggressively in high-dimensional input.
- As a sanity-check baseline before reaching for a Gaussian kernel.
The trade-off versus the Gaussian kernel:
| Polynomial kernel | Gaussian kernel | |
|---|---|---|
| Embedding dim | Finite, | Infinite |
| Hyperparameters | Degree | Bandwidth |
| Boundary shape | Polynomial in | Any smooth shape |
| Built-in regularisation | Yes (degree caps complexity) | No (need explicit regularisation) |
| Sensitive to feature scale | Yes | Yes (more so) |
Active Recall
For 100-dimensional input, the polynomial kernel of degree 5 has roughly implicit feature dimensions. How is this different in cost from running an SVM in that explicit feature space?
In the primal, you’d need to materialise every — numbers — and solve a QP with variables. Memory and time both blow up. With the kernel, you only ever evaluate — one 100-dim inner product plus one exponentiation. The dual QP has variables (one per training point), independent of polynomial degree. The kernel trick converts what would be a -dimensional optimisation into an -dimensional one.
Related
- kernel-trick — what makes the polynomial embedding tractable
- non-linear-transformation — explicit polynomial basis expansion is the primal version
- gaussian-kernel — alternative non-linear kernel; richer but unbounded in expressiveness
- mercers-condition — validity proof relies on the composition rules