Functional Analysis

Compact Operators and Kernels in ML

Kernel smoothing in machine learning is a compact integral operator. Mercer's theorem guarantees its SVD. Nystrom approximation reduces O(n³) in Gaussian Processes to O(nk²), opening GP to millions of points. Compact operators are the mathematics behind most kernel methods.

**Kernel SVM (sklearn.svm.SVC):** the Gram matrix K(xᵢ,xⱼ) is a discrete analogue of a compact HS operator. The dual SVM formulation works in RKHS.
**GPyTorch (Gaussian Processes):** Nystrom via Mercer decomposition. Reduces O(n³) → O(nk²). Used for surrogate models in Bayesian optimization.
**Neural Tangent Kernel:** infinitely wide neural networks are equivalent to a GP with NTK kernel = compact operator. Training = kernel regression in RKHS.

Compact Operators: Almost Finite-Dimensional

scikit-learn's SVM with RBF kernel processes 1M+ training samples via the kernel trick. Mercer's theorem guarantees the Gaussian kernel admits SVD in infinite dimensions. GPyTorch's Gaussian Processes, used by Google for hyperparameter Bayesian optimization, are built on exactly these compact operators.

An operator T: X → Y is compact if the image of every bounded set is **precompact** (closure is compact). Equivalently: from every bounded sequence {xₙ}, one can extract a subsequence such that {Txₙₖ} converges. Compact operators are 'almost finite-dimensional': they behave like matrices.

**Compact operator**: T ∈ K(X,Y) - closure of T(B_X) is compact in Y; equivalently T = lim Tₙ with rank(Tₙ) < ∞ (limit of finite-rank operators in a Hilbert space). **Space of compact operators**: K(X) ⊂ B(X) - a closed two-sided ideal; if T is compact and S is bounded, then ST and TS are compact. This is the largest proper two-sided ideal in B(X). **Identity operator is not compact**: I: H → H is not compact when dim H = ∞ - from the orthonormal sequence {eₙ} no convergent subsequence can be extracted since ‖eₙ − eₘ‖ = √2; the unit ball in infinite dimensions is non-compact.

```python import numpy as np # Diagonal compact operator: lambda_n = 1/n -> 0 # T(x_1, x_2, ...) = (x_1/1, x_2/2, x_3/3, ...) N = 300 lambda_n = 1.0 / np.arange(1, N + 1) # lambda_n = 1/n -> 0 # Bounded sequence of random vectors np.random.seed(42) X_bounded = np.random.randn(N, 30) # 30 vectors X_bounded /= np.linalg.norm(X_bounded, axis=0) # normalize ||x_k|| = 1 TX = lambda_n[:, None] * X_bounded # image of T norms_TX = np.linalg.norm(TX, axis=0) print(f"Input norms: min={np.linalg.norm(X_bounded,axis=0).min():.3f}, max={np.linalg.norm(X_bounded,axis=0).max():.3f}") print(f"Image norms TX: max={norms_TX.max():.3f} (compactness: image 'contracted')") # Integral operator: K(x,y) = min(x,y) on [0,1] # This is the Volterra operator - compact on L^2 n_grid = 100 x = np.linspace(0, 1, n_grid) dx = x[1] - x[0] K_matrix = np.minimum(x[:, None], x[None, :]) * dx svd_vals = np.linalg.svd(K_matrix, compute_uv=False) print(f"\nIntegral operator K(x,y)=min(x,y):") print(f"Singular values -> 0: sigma_1={svd_vals[0]:.4f}, sigma_10={svd_vals[9]:.6f}, sigma_50={svd_vals[49]:.8f}") ```

**Compactness in ML:** the kernel matrix K(xᵢ, xⱼ) is a discrete version of a compact integral operator. Low-rank approximation (Nystrom method) replaces a compact operator with a finite-rank operator. sklearn.kernel_approximation.Nystroem uses exactly this.

The identity operator I is not compact on l². The diagonal T(xₙ) = xₙ/n is compact. What is the fundamental difference, given that both are bounded?

Boundedness does not imply compactness. For I: {eₙ} is a bounded sequence with no convergent subsequence (‖eₙ−eₘ‖ = √2 for all n≠m) - the unit ball in l² is non-compact. For T: Teₙ = eₙ/n → 0 in norm, so every bounded sequence under T has coordinates 'squeezed' to zero in distant components, guaranteeing a convergent subsequence. The key is λₙ = 1/n → 0.

Riesz-Schauder Theorem: Spectrum of a Compact Operator

The spectrum of an arbitrary operator can be any closed subset of C. Compact operators are an exception: their spectrum looks almost like that of a matrix, just of infinite size.

```python import numpy as np # Riesz-Schauder theorem: spectrum of integral operator # K(x,y) = min(x,y) on [0,1]: eigenvalues ~= 1/(pi*(n-0.5))^2 n_grid = 200 x = np.linspace(0, 1, n_grid) dx = x[1] - x[0] K = np.minimum(x[:, None], x[None, :]) * dx # discretization evals, evecs = np.linalg.eigh(K) idx = np.argsort(np.abs(evals))[::-1] evals = evals[idx] evecs = evecs[:, idx] print("Spectrum of compact integral operator K(x,y) = min(x,y):") print(" lambda_n -> 0 (compactness!):") for n in range(1, 6): theory = 1 / (np.pi * (n - 0.5))**2 print(f" lambda_{n} = {evals[n-1]:.6f} (theory: {theory:.6f})") print(f"\n lambda_10 = {evals[9]:.8f} (decaying to zero)") print(f" lambda_50 = {evals[49]:.10f}") # Orthonormality of eigenfunctions G = evecs[:, :5].T @ evecs[:, :5] * dx print(f"\nOrthonormality of 5 eigenfunctions: ||V^TV - I|| = {np.max(np.abs(G - np.eye(5))):.2e}") ```

**Hierarchy of compact operators:** finite rank ⊂ nuclear (trace-class, Σσₙ < ∞) ⊂ Hilbert-Schmidt (Σσₙ² < ∞) ⊂ compact. Kernel SVM Gram matrix is Hilbert-Schmidt. Neural network NTK (Neural Tangent Kernel) is a compact operator.

A compact self-adjoint T has countable spectrum {λₙ} with λₙ → 0. How is T written in terms of eigenvalues and eigenfunctions? What is the analogy with matrix SVD?

For a compact self-adjoint operator T, the spectral theorem gives T = Σₙ λₙ⟨·, eₙ⟩eₙ with orthonormal eigenfunctions {eₙ} and λₙ → 0. This is the infinite-dimensional SVD: compare with M = Σᵢ σᵢ uᵢvᵢᵀ where for self-adjoint operators left and right singular vectors coincide (eₙ = uₙ = vₙ). The rank-k approximation Tₖ = Σ_{n≤k} λₙ⟨·,eₙ⟩eₙ is the analogue of truncated SVD.

Hilbert-Schmidt Operators and Mercer's Theorem

A Hilbert-Schmidt (HS) operator is a compact operator with an additional condition: the Frobenius norm (generalized to infinite dimensions) is finite. All integral operators with K ∈ L²(Ω×Ω) are HS operators. Mercer's theorem is their spectral theorem.

```python import numpy as np # Mercer's theorem: kernel decomposition via eigenfunctions # Gaussian kernel K(x,y) = exp(-|x-y|^2 / 2*sigma^2) on a discrete grid np.random.seed(42) N = 100 x = np.linspace(0, 5, N) sigma = 1.0 # Kernel matrix (Gram matrix) K_gram = np.exp(-(x[:, None] - x[None, :])**2 / (2 * sigma**2)) # Mercer decomposition: K = V Lambda V^T evals, V = np.linalg.eigh(K_gram) # Sort descending idx = np.argsort(evals)[::-1] evals = evals[idx] V = V[:, idx] # Verify this is an SVD of the kernel print(f"Kernel matrix eigenvalues: {np.round(evals[:5], 2)}") print(f"All lambda_n >= 0 (PSD): {np.all(evals > -1e-10)}") # Nystrom approximation: replace compact operator with finite-rank for k in [5, 20, 50]: K_approx = (V[:, :k] * evals[None, :k]) @ V[:, :k].T error = np.linalg.norm(K_gram - K_approx, 'fro') / np.linalg.norm(K_gram, 'fro') print(f"Nystrom k={k}: relative error = {error:.4f}") # k=5: 0.28, k=20: 0.08, k=50: 0.01 - fast decay! ```

**Gaussian Processes in PyTorch (GPyTorch):** the covariance kernel K(x,y) is an HS operator. GP prediction = posterior conditional = kernel matrix inversion O(n³). Nystrom approximation via Mercer decomposition reduces this to O(nk²). This is how GPyTorch is implemented for large datasets.

Mercer's theorem states K(x,y) = Σλₙφₙ(x)φₙ(y). Explain the connection between this decomposition and Nystrom approximation in kernel ML.

Mercer's theorem gives K(x,y) = Σₙ λₙφₙ(x)φₙ(y) - the exact SVD of the kernel as an operator. The rank-k truncation K_k(x,y) = Σ_{n≤k} λₙφₙ(x)φₙ(y) is a finite-rank operator. Nystrom approximation implements this by selecting k landmark points {xᵢ} to estimate φₙ on a discrete grid - replacing an n×n Gram matrix (O(n²)) with n×k factors (O(nk)). The approximation error equals the tail Σ_{n>k} λₙ², which is small when the spectrum decays rapidly.

Key Ideas

**Compact T**: bounded → precompact. Equivalently: limit of finite-rank operators. I is not compact.
**Riesz-Schauder**: σ(T)\{0} = σ_p(T)\{0}, spectrum is countable, accumulates at 0. Multiplicities are finite.
**Fredholm alternative**: (I-T)u=f is solvable ↔ f ⊥ ker(I-T*). Infinite-dimensional Kronecker-Capelli.
**HS operator**: ‖T‖²_HS = Σ‖Teₙ‖² = Tr(T*T) = Σσₙ². Integral operator with K ∈ L² is HS.
**Mercer's theorem**: K(x,y) = Σλₙφₙ(x)φₙ(y) - SVD of the kernel. Foundation of Nystrom, kernel PCA, GP.

Вопросы для размышления

Nystrom approximation selects k landmark points randomly or by importance sampling. How does Mercer's theorem explain why choosing points from high-density data regions is important?
The Neural Tangent Kernel of infinitely wide networks is a compact operator. How does this explain the double descent phenomenon: why does an over-parameterized network with sufficient width generalize again?
Riesz-Schauder says the spectrum of a compact operator is countable. For an integral operator with smooth kernel K ∈ C^∞, eigenvalues decay rapidly. How does the rate of decay of λₙ relate to the smoothness of K?

Связанные уроки

la-13-eigenvectors