Measure Theory

Optimal Transport: A Measure-Theoretic View

How can two probability distributions be compared as mass distributions, and what does this have to do with neural networks?

**Wasserstein GAN (WGAN):** replacing KL with W1 eliminated vanishing gradients, stabilized training, and produced 10,000+ derivative works
**Biomedicine:** optimal transport compares gene expression histograms in single-cell RNA-seq, identifying cell transitions in differentiation
**Economics:** Monge's original problem (1781) - optimal allocation of earthwork; Brenier's theorem (1987) solved it via convex potentials
**Computational physics:** geodesics in Wasserstein space describe smooth deformations of distributions used in fluid flow simulations

Предварительные знания

Disintegration of measures
Weak convergence of measures
Convex analysis and duality

Disintegration of Measures

The Monge Problem and Kantorovich Relaxation

In 1781, Gaspard Monge tried to minimize earthwork for military fortifications. 161 years later, Leonid Kantorovich (1942) rescued the problem from a dead end: instead of a rigid map, he introduced random transport plans. The same idea underlies the Wasserstein GAN today: instead of a deterministic function, a probabilistic coupling between distributions.

Kantorovich received the 1975 Nobel Prize in economics for linear programming. The duality between OT and LP is no coincidence: transport on finite supports is a classical transportation linear program.

Why may the Monge problem have no solution while the Kantorovich problem always does?

Wasserstein Distance and Duality

In 2017, Facebook Research released Wasserstein GAN: replacing KL divergence with W1 eliminated vanishing gradients and stabilized training by a factor of 5. By 2024, the paper had spawned over 10,000 follow-up publications. The key mechanism is Kantorovich-Rubinstein duality, which turns W1 into a 1-Lipschitz neural network training problem.

For one-dimensional measures the optimal transport map is the quantile transform T*(x) = F_nu^{-1}(F_mu(x)). Sorting both samples and matching by rank yields the monotone coupling.

Kantorovich-Rubinstein duality: W1(mu,nu) = sup_{||f||_L<=1} integral f d(mu-nu). What does ||f||_L <= 1 require?

Brenier Theorem and Wasserstein Geometry

In 1987 Yves Brenier made a surprising turn: the optimal transport map for quadratic cost on R^d is always unique and is the gradient of a convex function. This linked OT with Monge-Ampere theory and gave rise to Wasserstein space as a Riemannian manifold of measures - the foundation of McCann displacement interpolation and gradient flows for the continuity equation.

Connections to other areas

Optimal transport unifies convex analysis, measure theory, and differential geometry.

Generative models — WGAN and WGAN-GP use W1 as the loss function - robustness to degenerate distributions eliminates mode collapse
Differential geometry — P_2(R^d) is a metric space with geodesics described by McCann displacement interpolation
Partial differential equations — Monge-Ampere equation for the optimal potential; continuity equation as gradient flow in P_2 (JKO formalism)
Computational biology — Sinkhorn entropic regularization computes approximate W_p in O(n^2); applied to single-cell RNA-seq and cell atlases

Итоги

Monge problem seeks a deterministic map T: may fail when mu is atomic
Kantorovich relaxation admits random plans gamma on X x Y; linear program always solvable
W_p(mu,nu) is a metric on probability measures sensitive to support geometry
Kantorovich-Rubinstein duality: W1 = sup_{Lip_1} (E_mu[f] - E_nu[f]); the foundation of WGAN
Brenier theorem: for W2 on R^d the optimal T* = nabla psi is unique via a convex potential
Displacement interpolation - geodesics in P_2: the basis of gradient flows for PDE

What does Brenier's theorem state for optimal transport with quadratic cost on R^d?