Measure Theory

Duality and the Riesz Representation Theorem

How does a neural network 'see' data? What is an SVM kernel geometrically? Duality theory answers: each observation is a functional in the hypothesis space, the dual optimization finds coefficients for these functionals, and RKHS makes this space concrete and computable.

  • **SVMs and kernel methods:** the kernel trick works because SVMs operate in an RKHS where evaluation functionals are bounded; the dual variables αᵢ are weights on these functionals
  • **Neural Tangent Kernel:** infinitely wide networks train like kernel methods in an RKHS, providing theoretical convergence and generalization guarantees
  • **Gaussian processes:** a GP is a measure on a function space, and its kernel generates an RKHS in which the GP searches for the optimum

Предварительные знания

  • Lp Spaces

Dual Space and Bounded Linear Functionals

Every normed space X has a **dual space** X* - the collection of all bounded linear functionals on X. Understanding duality is the key to interpreting dual optimization problems and to grasping what a 'linear model' means in an infinite-dimensional feature space.

**Bounded linear functional:** a linear map φ: X → ℝ is bounded if: ‖φ‖_{X*} = sup { |φ(f)| : ‖f‖_X ≤ 1 } < ∞ **Dual space X*** = { all bounded linear functionals on X } with the operator norm ‖φ‖_{X*}. X* is always complete (Banach), even if X is not.

**Dual problem in optimization:** in SVMs the primal problem (finding a hyperplane) transforms into the dual (finding support vectors). This is duality theory in action: each support vector corresponds to a functional in the dual of the feature space, and the dual variables αᵢ are its coordinates.

The norm of a dual functional ‖φ‖_{X*} = sup_{‖f‖≤1} |φ(f)| measures the 'sensitivity' of the functional to unit perturbations. In ML this quantifies how much the model's prediction can change in response to a unit-norm perturbation of the input.

Is the functional φ(f) = f(0) bounded on L²([0,1])?

The Dual of Lp is Lq

The main result on Lp duality: the dual of Lp is Lq, where 1/p + 1/q = 1. Every bounded linear functional on Lp has an integral representation via a function in Lq.

**Riesz Representation Theorem (for Lp):** let 1 ≤ p < ∞ and 1/p + 1/q = 1. Then: (Lp(μ))* ≅ Lq(μ) More precisely: for every φ ∈ (Lp)* there exists a unique g ∈ Lq such that: φ(f) = ∫ f · g dμ for all f ∈ Lp and ‖φ‖_{(Lp)*} = ‖g‖_q. **Exception:** (L¹)* = L∞, but (L∞)* strictly contains L¹.

**Special case p=2:** L² is self-dual! Its dual is again L². This follows from L² being a Hilbert space: the inner product ⟨f,g⟩ = ∫fg dμ provides the isomorphism with the dual. This property sets L² apart from all other Lp spaces.

**Statistics application:** in ordinary least squares regression (an L²-problem), the estimator β = (X^TX)^{-1}X^Ty is the dual-space solution. The primal problem lives in the parameter space; the dual lives in the observation space. The matrix X^TX mediates between them.

The dual space of L³([0,1]) is:

Reproducing Kernel Hilbert Spaces (RKHS)

The Riesz theorem for L² has an important extension: in special Hilbert spaces called **RKHS** (Reproducing Kernel Hilbert Spaces), the point evaluation functional f ↦ f(x₀) becomes bounded. This makes RKHS the natural function space for machine learning theory.

**RKHS:** a Hilbert space H of functions f: X → ℝ is called a reproducing kernel Hilbert space if for every x ∈ X the evaluation functional φ_x: f ↦ f(x) is bounded on H. By the Riesz theorem: there exists a unique **reproducing kernel** k: X×X → ℝ such that: f(x) = ⟨f, k(x,·)⟩_H for all f ∈ H, x ∈ X The function k(x,·) is the 'dual object' corresponding to the point x in H.

**Representer Theorem:** the optimal solution in RKHS when minimizing the regularized functional Σ L(f(xᵢ), yᵢ) + λ‖f‖²_H has the form f* = Σᵢ αᵢ k(xᵢ, ·). This is the theoretical justification of the kernel trick in SVMs: instead of working in the infinite-dimensional H, it suffices to solve a finite-dimensional system over the training points.

RKHS is the bridge between measure theory and practical ML. Examples: the Gaussian kernel generates the space of infinitely differentiable functions; the polynomial kernel generates polynomial functions; the linear kernel gives ordinary Euclidean space. Choosing a kernel = choosing a hypothesis space.

Why is f ↦ f(x₀) bounded in an RKHS but not in L²?

Functional Analysis of Neural Networks

Dual space theory actively informs our understanding of neural networks. In the infinite-width limit, neural networks connect to RKHS via the Neural Tangent Kernel, and neural network regularization can be interpreted through norms in function spaces.

**Neural Tangent Kernel (NTK):** as network width → ∞ with small learning rate, the network behaves like a kernel method in an RKHS with kernel: k_NTK(x, x') = E_θ[∇_θ f(x,θ) · ∇_θ f(x',θ)] Training is equivalent to minimizing the squared loss in this RKHS, and the weights converge to the kernel regression solution.

**Barron norm:** for single-hidden-layer networks a specific norm is defined that controls the approximation capacity. Functions with small Barron norm can be approximated by a network with polynomially many neurons, independent of input dimension - a theoretical explanation for why neural networks escape the curse of dimensionality.

The dual space in ML is the space of 'data valuations'. In kernel methods each training point xᵢ contributes a functional k(xᵢ, ·), and the dual variables αᵢ are their weights. Support vectors in SVMs are precisely the training points whose αᵢ are nonzero - the most informative functionals.

The Neural Tangent Kernel connects infinitely wide neural networks to:

Key Ideas

  • **Dual space X*:** bounded linear functionals on X; the dual of Lp is Lq for 1/p+1/q=1; L² is self-dual
  • **Riesz theorem for Lp:** every bounded functional on Lp has the form φ(f) = ∫fg dμ for a unique g ∈ Lq
  • **RKHS:** a Hilbert space where evaluation functionals are bounded; the reproducing kernel k implements f(x) = ⟨f, k(x,·)⟩
  • **Applications:** SVM kernel trick, Neural Tangent Kernel, Bayesian optimization, the representer theorem

Related Topics

Duality connects Lp spaces to optimization and ML:

  • Lp Spaces — The dual of Lp is Lq - a direct consequence of the Lp norm structure
  • Product Measures and Fubini — The integral representation φ(f) = ∫fg dμ connects to the theory of double integrals

Вопросы для размышления

  • Why is L² self-dual while L¹ is not? What does self-duality mean geometrically for the unit ball?
  • In SVMs the dual variables αᵢ measure the 'importance' of each training point. How does this relate to the notion of a bounded functional in the dual space?
  • The Gaussian RKHS contains infinitely differentiable functions. How does this constrain what an infinitely wide neural network can learn in the NTK regime?

Связанные уроки

  • top-04
Duality and the Riesz Representation Theorem

0

1

Sign In