Measure Theory
Duality and the Riesz Representation Theorem
How does a neural network 'see' data? What is an SVM kernel geometrically? Duality theory answers: each observation is a functional in the hypothesis space, the dual optimization finds coefficients for these functionals, and RKHS makes this space concrete and computable.
- **SVMs and kernel methods:** the kernel trick works because SVMs operate in an RKHS where evaluation functionals are bounded; the dual variables αᵢ are weights on these functionals
- **Neural Tangent Kernel:** infinitely wide networks train like kernel methods in an RKHS, providing theoretical convergence and generalization guarantees
- **Gaussian processes:** a GP is a measure on a function space, and its kernel generates an RKHS in which the GP searches for the optimum
Предварительные знания
Dual Space and Bounded Linear Functionals
Every normed space X has a **dual space** X* - the collection of all bounded linear functionals on X. Understanding duality is the key to interpreting dual optimization problems and to grasping what a 'linear model' means in an infinite-dimensional feature space.
**Bounded linear functional:** a linear map φ: X → ℝ is bounded if: ‖φ‖_{X*} = sup { |φ(f)| : ‖f‖_X ≤ 1 } < ∞ **Dual space X*** = { all bounded linear functionals on X } with the operator norm ‖φ‖_{X*}. X* is always complete (Banach), even if X is not.
**Dual problem in optimization:** in SVMs the primal problem (finding a hyperplane) transforms into the dual (finding support vectors). This is duality theory in action: each support vector corresponds to a functional in the dual of the feature space, and the dual variables αᵢ are its coordinates.
The norm of a dual functional ‖φ‖_{X*} = sup_{‖f‖≤1} |φ(f)| measures the 'sensitivity' of the functional to unit perturbations. In ML this quantifies how much the model's prediction can change in response to a unit-norm perturbation of the input.
Is the functional φ(f) = f(0) bounded on L²([0,1])?
The Dual of Lp is Lq
The main result on Lp duality: the dual of Lp is Lq, where 1/p + 1/q = 1. Every bounded linear functional on Lp has an integral representation via a function in Lq.
**Riesz Representation Theorem (for Lp):** let 1 ≤ p < ∞ and 1/p + 1/q = 1. Then: (Lp(μ))* ≅ Lq(μ) More precisely: for every φ ∈ (Lp)* there exists a unique g ∈ Lq such that: φ(f) = ∫ f · g dμ for all f ∈ Lp and ‖φ‖_{(Lp)*} = ‖g‖_q. **Exception:** (L¹)* = L∞, but (L∞)* strictly contains L¹.
**Special case p=2:** L² is self-dual! Its dual is again L². This follows from L² being a Hilbert space: the inner product ⟨f,g⟩ = ∫fg dμ provides the isomorphism with the dual. This property sets L² apart from all other Lp spaces.
**Statistics application:** in ordinary least squares regression (an L²-problem), the estimator β = (X^TX)^{-1}X^Ty is the dual-space solution. The primal problem lives in the parameter space; the dual lives in the observation space. The matrix X^TX mediates between them.
The dual space of L³([0,1]) is:
Reproducing Kernel Hilbert Spaces (RKHS)
The Riesz theorem for L² has an important extension: in special Hilbert spaces called **RKHS** (Reproducing Kernel Hilbert Spaces), the point evaluation functional f ↦ f(x₀) becomes bounded. This makes RKHS the natural function space for machine learning theory.
**RKHS:** a Hilbert space H of functions f: X → ℝ is called a reproducing kernel Hilbert space if for every x ∈ X the evaluation functional φ_x: f ↦ f(x) is bounded on H. By the Riesz theorem: there exists a unique **reproducing kernel** k: X×X → ℝ such that: f(x) = ⟨f, k(x,·)⟩_H for all f ∈ H, x ∈ X The function k(x,·) is the 'dual object' corresponding to the point x in H.
**Representer Theorem:** the optimal solution in RKHS when minimizing the regularized functional Σ L(f(xᵢ), yᵢ) + λ‖f‖²_H has the form f* = Σᵢ αᵢ k(xᵢ, ·). This is the theoretical justification of the kernel trick in SVMs: instead of working in the infinite-dimensional H, it suffices to solve a finite-dimensional system over the training points.
RKHS is the bridge between measure theory and practical ML. Examples: the Gaussian kernel generates the space of infinitely differentiable functions; the polynomial kernel generates polynomial functions; the linear kernel gives ordinary Euclidean space. Choosing a kernel = choosing a hypothesis space.
Why is f ↦ f(x₀) bounded in an RKHS but not in L²?
Functional Analysis of Neural Networks
Dual space theory actively informs our understanding of neural networks. In the infinite-width limit, neural networks connect to RKHS via the Neural Tangent Kernel, and neural network regularization can be interpreted through norms in function spaces.
**Neural Tangent Kernel (NTK):** as network width → ∞ with small learning rate, the network behaves like a kernel method in an RKHS with kernel: k_NTK(x, x') = E_θ[∇_θ f(x,θ) · ∇_θ f(x',θ)] Training is equivalent to minimizing the squared loss in this RKHS, and the weights converge to the kernel regression solution.
**Barron norm:** for single-hidden-layer networks a specific norm is defined that controls the approximation capacity. Functions with small Barron norm can be approximated by a network with polynomially many neurons, independent of input dimension - a theoretical explanation for why neural networks escape the curse of dimensionality.
The dual space in ML is the space of 'data valuations'. In kernel methods each training point xᵢ contributes a functional k(xᵢ, ·), and the dual variables αᵢ are their weights. Support vectors in SVMs are precisely the training points whose αᵢ are nonzero - the most informative functionals.
The Neural Tangent Kernel connects infinitely wide neural networks to:
Key Ideas
- **Dual space X*:** bounded linear functionals on X; the dual of Lp is Lq for 1/p+1/q=1; L² is self-dual
- **Riesz theorem for Lp:** every bounded functional on Lp has the form φ(f) = ∫fg dμ for a unique g ∈ Lq
- **RKHS:** a Hilbert space where evaluation functionals are bounded; the reproducing kernel k implements f(x) = ⟨f, k(x,·)⟩
- **Applications:** SVM kernel trick, Neural Tangent Kernel, Bayesian optimization, the representer theorem
Related Topics
Duality connects Lp spaces to optimization and ML:
- Lp Spaces — The dual of Lp is Lq - a direct consequence of the Lp norm structure
- Product Measures and Fubini — The integral representation φ(f) = ∫fg dμ connects to the theory of double integrals
Вопросы для размышления
- Why is L² self-dual while L¹ is not? What does self-duality mean geometrically for the unit ball?
- In SVMs the dual variables αᵢ measure the 'importance' of each training point. How does this relate to the notion of a bounded functional in the dual space?
- The Gaussian RKHS contains infinitely differentiable functions. How does this constrain what an infinitely wide neural network can learn in the NTK regime?