Statistical Learning Theory

Reproducing Kernel Hilbert Spaces (RKHS)

In 2001 Scholkopf and Smola showed that the RBF kernel implicitly works in an infinite-dimensional space of analytic functions. SVM with RBF kernel won the MNIST competition in 1998 with 0.8% error -- the best result before deep learning.

Gaussian Process Regression in Bayesian Optimization (Google Vizier, 2017) uses RKHS structure for optimality guarantees.
GPyTorch from Cornell (2018) scales KRR to 10^6 points via structured approximations.

Mercer's theorem and RKHS structure

In 2001 Bernhard Scholkopf and Alex Smola published 'Learning with Kernels', extending Mercer's 1909 results for machine learning. The key fact: any continuous symmetric positive-definite kernel k(x,x') is an inner product in some Hilbert feature space. The Gaussian Process in Scikit-learn uses an RBF kernel implicitly implementing RKHS. In 2024 Google Brain proved that transformers compute approximate kernel regressions.

The reproducing property of RKHS states that...

Representer theorem and kernel ridge regression

The representer theorem (Kimeldorf and Wahba 1971) states: the minimizer of a regularized functional in RKHS lies in the finite-dimensional subspace spanned by kernel functions at training points. This converts an infinite-dimensional problem into a system of n linear equations.

The representer theorem allows us to...

Key results

RKHS: function space with reproducing property f(x) = <f, k(x,.)>.
Mercer: p.d. kernel = eigenfunction expansion of integral operator.
Representer theorem: RKHS optimum lies in span{k(x_1,.), ..., k(x_n,.)}.
KRR: alpha = (K + n*lam*I)^{-1} y -- closed form in O(n^3).