Measure Theory
Optimal Transport: A Measure-Theoretic View
How can two probability distributions be compared as mass distributions, and what does this have to do with neural networks?
- **Wasserstein GAN (WGAN):** replacing KL with W1 eliminated vanishing gradients, stabilized training, and produced 10,000+ derivative works
- **Biomedicine:** optimal transport compares gene expression histograms in single-cell RNA-seq, identifying cell transitions in differentiation
- **Economics:** Monge's original problem (1781) - optimal allocation of earthwork; Brenier's theorem (1987) solved it via convex potentials
- **Computational physics:** geodesics in Wasserstein space describe smooth deformations of distributions used in fluid flow simulations
Предварительные знания
- Disintegration of measures
- Weak convergence of measures
- Convex analysis and duality
The Monge Problem and Kantorovich Relaxation
In 1781, Gaspard Monge tried to minimize earthwork for military fortifications. 161 years later, Leonid Kantorovich (1942) rescued the problem from a dead end: instead of a rigid map, he introduced random transport plans. The same idea underlies the Wasserstein GAN today: instead of a deterministic function, a probabilistic coupling between distributions.
Kantorovich received the 1975 Nobel Prize in economics for linear programming. The duality between OT and LP is no coincidence: transport on finite supports is a classical transportation linear program.
Why may the Monge problem have no solution while the Kantorovich problem always does?
Wasserstein Distance and Duality
In 2017, Facebook Research released Wasserstein GAN: replacing KL divergence with W1 eliminated vanishing gradients and stabilized training by a factor of 5. By 2024, the paper had spawned over 10,000 follow-up publications. The key mechanism is Kantorovich-Rubinstein duality, which turns W1 into a 1-Lipschitz neural network training problem.
For one-dimensional measures the optimal transport map is the quantile transform T*(x) = F_nu^{-1}(F_mu(x)). Sorting both samples and matching by rank yields the monotone coupling.
Kantorovich-Rubinstein duality: W1(mu,nu) = sup_{||f||_L<=1} integral f d(mu-nu). What does ||f||_L <= 1 require?
Brenier Theorem and Wasserstein Geometry
In 1987 Yves Brenier made a surprising turn: the optimal transport map for quadratic cost on R^d is always unique and is the gradient of a convex function. This linked OT with Monge-Ampere theory and gave rise to Wasserstein space as a Riemannian manifold of measures - the foundation of McCann displacement interpolation and gradient flows for the continuity equation.
Connections to other areas
Optimal transport unifies convex analysis, measure theory, and differential geometry.
- Generative models — WGAN and WGAN-GP use W1 as the loss function - robustness to degenerate distributions eliminates mode collapse
- Differential geometry — P_2(R^d) is a metric space with geodesics described by McCann displacement interpolation
- Partial differential equations — Monge-Ampere equation for the optimal potential; continuity equation as gradient flow in P_2 (JKO formalism)
- Computational biology — Sinkhorn entropic regularization computes approximate W_p in O(n^2); applied to single-cell RNA-seq and cell atlases
Итоги
- Monge problem seeks a deterministic map T: may fail when mu is atomic
- Kantorovich relaxation admits random plans gamma on X x Y; linear program always solvable
- W_p(mu,nu) is a metric on probability measures sensitive to support geometry
- Kantorovich-Rubinstein duality: W1 = sup_{Lip_1} (E_mu[f] - E_nu[f]); the foundation of WGAN
- Brenier theorem: for W2 on R^d the optimal T* = nabla psi is unique via a convex potential
- Displacement interpolation - geodesics in P_2: the basis of gradient flows for PDE
What does Brenier's theorem state for optimal transport with quadratic cost on R^d?