Statistical Learning Theory

Kernel Methods and RKHS

NTK theory in 2018 showed that an infinitely wide neural network is a kernel machine. This cast the Zhang (2017) generalization paradox in a new light. Google uses kernel methods in production (Gaussian Processes for hyperparameter optimization). SVM is the standard for malware detection at Microsoft. Understanding kernels means understanding the foundation of both classical ML and deep learning theory.

**Gaussian Processes in Google Vizier:** Bayesian optimization for automatic hyperparameter selection in TensorFlow/JAX models. Prior = GP with RBF kernel.
**SVM for malware detection:** Kaspersky and Microsoft Defender use SVM with custom kernels on byte n-grams to classify PE files.
**NTK and scaling laws:** Chinchilla scaling laws (DeepMind 2022) are partly explained by NTK theory - loss predictions through the kernel eigenspectrum.
**Kernel PCA:** used in dimensionality reduction for non-linear data where standard PCA fails; applied in genomics and chemistry for molecular fingerprints.

Предварительные знания

lt-16-boosting
lt-04-vc-dimension

Kernel Trick: Infinite Feature Maps Without Computation

**An infinitely wide neural network trains as a kernel machine.** Neural Tangent Kernel (NTK) theory (2018) showed: at random initialization a network in the infinite-width limit is kernel regression with a fixed kernel. Understanding kernel methods means understanding the limit of neural networks.

Kernel trick: for any algorithm that uses only dot products ⟨φ(x), φ(x')⟩, you never need to compute φ(x) explicitly. It suffices to have k(x, x') = ⟨φ(x), φ(x')⟩. If φ maps to an infinite-dimensional space (all polynomial monomials) - direct computation is impossible. Kernel trick: O(n²) instead of O(n·∞).

**Popular kernels and their feature maps:** **Polynomial kernel:** k(x,x') = (xᵀx' + c)^d φ(x) = all monomials of degree ≤ d **RBF/Gaussian kernel:** k(x,x') = exp(−||x−x'||²/(2σ²)) φ(x) = infinite-dimensional vector (Taylor expansion of exp) **Neural Tangent Kernel:** k_NTK(x,x') = E[∇_θf(x,θ)ᵀ·∇_θf(x',θ)] φ(x) = gradient of the network with respect to its parameters **Mercer's condition:** k is a valid kernel ⟺ k is symmetric and positive semi-definite. ∀x₁,...,xₙ: the Gram matrix K with Kᵢⱼ = k(xᵢ,xⱼ) is PSD.

The RBF kernel corresponds to an infinite-dimensional feature map. How does learning remain computationally feasible? What determines the computational complexity?

RKHS and the Representer Theorem

A **Reproducing Kernel Hilbert Space (RKHS)** is a function space H_k with norm ||f||_k defined by the kernel k. Key property: f(x) = ⟨f, k(·,x)⟩_k for all f ∈ H_k - the 'reproducing' property. Minimizing a loss plus ||f||² in H_k has an exact finite-dimensional solution.

**Representer Theorem (Kimeldorf-Wahba 1971):** Given the problem: min_{f ∈ H_k} [Σᵢ L(yᵢ, f(xᵢ)) + λ||f||²_k] The optimal solution has the form: f*(x) = Σᵢ αᵢ·k(xᵢ, x) Only n basis functions k(xᵢ, ·) - regardless of the dimension of H_k! Substituting into the problem: min_α [L(y, Kα) + λ·αᵀKα] where K is the kernel matrix, Kᵢⱼ = k(xᵢ, xⱼ). For kernel ridge regression: α = (K + λI)⁻¹y Prediction: f(x_new) = kᵀα = Σᵢ αᵢ·k(xᵢ, x_new)

Gaussian Processes are the Bayesian interpretation of kernel regression. Prior: f ~ GP(0, k). The posterior mean equals kernel ridge regression with α = (K + σ²I)⁻¹y. The posterior variance gives confidence intervals. Used in Bayesian optimization (AutoML, hyperparameter tuning).

The representer theorem says the optimal f in RKHS is a linear combination of n functions k(xᵢ, ·). How does this connect to SVM - why is the SVM solution determined only by the support vectors?

Neural Tangent Kernel: Neural Networks as Kernel Machines

**NTK theorem (Jacot-Gabriel-Hongler, 2018):** at random initialization with infinite width, a neural network trained by gradient descent evolves so that the kernel does not change - it remains the fixed NTK K_∞(x,x'). In this regime a neural network is equivalent to kernel regression with NTK. This explains part of the double-descent and generalization paradox.

**Neural Tangent Kernel:** K_NTK(x,x') = Σ_l ⟨∇_{θ_l} f(x,θ), ∇_{θ_l} f(x',θ)⟩ where θ_l are the parameters of layer l. Under gradient descent: dθ/dt = −∇_θ L = −Kᵀ(f(X,θ) − y) At infinite width: K_NTK is fixed (does not change during training)! This makes the dynamics a linear ODE with a known closed-form solution. **Practical significance:** - Explains the lazy training regime (small lr, large width) - NTK poorly approximates hierarchical features - Real neural networks operate in the feature learning regime (the kernel changes) - This is the gap between NTK theory and practice

NTK theory predicts that an infinite-width neural network equals a kernel machine. Why do real finite-width networks outperform kernel machines in practice? What makes feature learning important?

Итоги

**Kernel trick:** k(x,x') = ⟨φ(x),φ(x')⟩ - dot product in feature space without explicit φ. RBF kernel: infinite-dimensional φ, O(n²) computation.
**Mercer's condition:** k is a valid kernel ⟺ the Gram matrix K is PSD for any set of points. Check: all eigenvalues ≥ 0.
**Representer theorem:** the optimal f in RKHS = Σᵢ αᵢ·k(xᵢ,·). Only n components! α = (K + λI)⁻¹y for kernel ridge regression.
**NTK:** K_NTK(x,x') = Σ_l ⟨∇_{θ_l} f(x), ∇_{θ_l} f(x')⟩. Fixed at infinite width. Network → kernel regression.
**Gaussian Processes:** Bayesian interpretation of kernel regression. Posterior mean = KRR. Posterior variance = uncertainty. Application: Bayesian optimization.

Связанные уроки

lt-13-deep-generalization — NTK explains deep learning through kernel methods
lt-04-vc-dimension — RKHS has infinite dimension yet finite generalization
lt-16-boosting — Both combine weak predictors into a strong one