Optimal Transport
Brenier's theorem: transport = gradient of a convex function
1991. Yann Brenier takes two distributions in $\mathbb{R}^d$ and closes a 210-year-old Monge problem with one formula: the optimal map is the gradient of a convex function. $T(x) = \nabla\varphi(x)$. Not an approximation, not a numerical scheme - an exact description of the solution. Thirty years later this formula becomes the foundation of the FID metric (used to score every GAN and diffusion model), the ICNN architecture, Flow Matching in Stable Diffusion 3 and Meta MovieGen. Optimal transport at quadratic cost and convex analysis turn out to be two names for the same object.
- **FID metric** (Heusel et al. 2017): Frechet Inception Distance - Brenier's formula for Gaussians in Inception feature space. $W_2^2 = \|m_r - m_f\|^2 + \mathrm{tr}(\Sigma_r + \Sigma_f - 2(\Sigma_r^{1/2}\Sigma_f\Sigma_r^{1/2})^{1/2})$. Standard for assessing GAN, diffusion, and flow-model quality
- **Input Convex Neural Networks** (Amos et al. 2017): architecture with guaranteed input convexity, $\nabla \text{ICNN}$ is a valid OT map per Brenier. Used in domain adaptation, 3D shape morphing, non-equilibrium thermodynamics
- **Flow Matching and Rectified Flow** (Lipman 2022, Liu 2022): training $v_t = \nabla\varphi_t$ as a neural network. Stable Diffusion 3, Meta MovieGen, Pika train 5-10x faster than classical DDPM
- **OT-Flow** (Onken et al. 2021): CNF with the Benamou-Brenier kinetic regularizer. Density estimation orders of magnitude faster than standard normalizing flow
Предварительные знания
- Kantorovich duality and c-concave functions
- Sinkhorn algorithm (for context on discrete approximation)
- Legendre transform (recommended)
Brenier's theorem: transport = gradient of a convex function
Monge problem: find $T: \mathbb{R}^d \to \mathbb{R}^d$ pushing $\mu$ to $\nu$ with minimum quadratic cost $\int \|x - T(x)\|^2 \, d\mu(x)$. Smooth densities, finite second moment - nothing else is required. Brenier 1991: the solution exists, is unique, and **looks like the gradient of a convex function**.
**Why convexity yields uniqueness**. If $T_1, T_2$ are two optimal maps with $T_i = \nabla\varphi_i$, both convex, then strict convexity of the quadratic cost together with cyclic monotonicity force $\varphi_1 = \varphi_2 + \text{const}$. The gradients agree everywhere, hence $T_1 = T_2$ a.e. with respect to $\mu$. Without convexity (for instance at $c = \|x-y\|$) uniqueness breaks: for $W_1$ the optimal map need not be a gradient and need not be unique.
Monge-Ampere equation
Mass conservation $T_\# \mu = \nu$ together with $T = \nabla\varphi$ produce a concrete PDE on $\varphi$. Through the change-of-variables formula the Jacobian of $T = \nabla\varphi$ equals $D^2\varphi$ (the Hessian), and the density condition becomes:
**Fully nonlinear elliptic PDE**. Linear elliptic equations (Laplace, Poisson) are sums of second derivatives. Monge-Ampere is a product of Hessian eigenvalues. Ellipticity holds only when $D^2\varphi \succ 0$, i.e. on the set of strictly convex $\varphi$ - precisely the Brenier condition. Caffarelli established regularity (1990s); for this and related theorems he received the Wolf Prize 2012 and the Abel Prize 2023.
From Brenier to WGAN and Flow Matching
Wasserstein-2 distance via the Brenier potential: $W_2^2(\mu,\nu) = \int \|x - \nabla\varphi(x)\|^2 d\mu$. This is primal. WGAN (Arjovsky 2017) trains the Kantorovich-Rubinstein dual of $W_1$ via a 1-Lipschitz critic - a different road. Brenier offers a direct path: parametrize $\varphi$ itself as a convex neural network and take its gradient. This is how ICNN (Amos 2017) and most modern OT-based generative models operate.
Continuous Normalizing Flows and Flow Matching push it one step further. Benamou-Brenier dynamic formulation (2000): $W_2^2(\mu,\nu) = \inf \int_0^1 \int \|v_t\|^2 \rho_t \, dx \, dt$ subject to the continuity equation $\partial_t \rho_t + \nabla\cdot(\rho_t v_t) = 0$. The minimum is attained at $v_t = \nabla\varphi_t$ - the velocity field is the gradient of a time-dependent potential. Flow Matching (Lipman 2022) trains a neural network $v_\theta(t,x)$ to predict this field. Stable Diffusion 3, Meta MovieGen, Pika are all built on this idea.
**Diffusion and OT - two sides of the same coin**. DDPM (Ho 2020) adds noise along a Markov chain and learns the reverse. In the small-step limit this becomes an SDE; dropping the stochastic term yields an ODE - a transport flow in the Brenier sense. Albergo-Vanden-Eijnden 2023 and Liu et al. 2022 (Rectified Flow) formalized the equivalence: deterministic diffusion = time-discretized OT. Any modern image generator implicitly solves Monge-Ampere.
Brenier's statement
1991. Yann Brenier takes two distributions in $\mathbb{R}^d$ (source absolutely continuous) and closes a 210-year-old Monge problem with a single formula: the gradient of a convex function. Not approximately. Exactly. In any dimension.
**Brenier's theorem** (1991, *Comm. Pure Appl. Math.*). Let $\mu, \nu$ be probability measures on $\mathbb{R}^d$ with finite second moment, $\mu$ absolutely continuous with respect to Lebesgue. Then the Monge problem with cost $c(x,y) = \tfrac{1}{2}\|x-y\|^2$ admits a unique solution $T$, and there exists a convex function $\varphi: \mathbb{R}^d \to \mathbb{R}$ such that:
This $\varphi$ is the **Brenier potential** and it is convex. Its link to Kantorovich duality goes through the dual c-concave potential $\psi(x) = \tfrac{1}{2}\|x\|^2 - \varphi(x)$: at quadratic cost, c-concavity of $\psi$ is exactly equivalent to convexity of $\varphi$. The map is a gradient. End of story.
**Why a convex function**. Cyclical monotonicity of an optimal plan: $\sum_i \langle x_i - x_{i+1}, T(x_i) \rangle \geq 0$ for any cycle. Rockafellar's theorem says this is exactly the condition for $T$ to be a subgradient of some convex function. For smooth $\mu$ the subgradient coincides with the gradient almost everywhere. The whole geometry follows from the form $\|x-y\|^2 = \|x\|^2 - 2\langle x,y \rangle + \|y\|^2$ - the linear-in-$y$ term is exactly the Legendre pairing.
What this means intuitively. Optimal transport at quadratic cost never crosses trajectories: if $T(x_1)=y_1, T(x_2)=y_2$, then segments $[x_1,y_1]$ and $[x_2,y_2]$ either disjoint or coincide. The gradient of a convex function is monotone in the multivariate sense: $\langle \nabla\varphi(x_1) - \nabla\varphi(x_2), x_1 - x_2 \rangle \geq 0$. Monotonicity rules out any crossing.
**Connection to polar factorization**. Matrix analog of Brenier: any matrix $A = OS$ with $O$ orthogonal, $S$ symmetric positive. Brenier is the infinite-dimensional generalization: any map $S: \mathbb{R}^d \to \mathbb{R}^d$ factors into a measure-preserving map and the gradient of a convex function. Polar factorization is a special case.
Brenier's theorem asserts the existence of a convex $\varphi$ such that $T = \nabla \varphi$ is the optimal map at $c = \|x-y\|^2$. Which condition on $\mu$ is essential for uniqueness?
Monge-Ampere equation
Brenier says $T = \nabla\varphi$. Mass conservation $T_\# \mu = \nu$ translates into a differential equation on the potential. Through the change-of-variables formula: if $\mu$ has density $\rho_0$, $\nu$ has density $\rho_1$, the Jacobian of $T = \nabla\varphi$ must absorb the density ratio.
This is the **Monge-Ampere equation** - a fully nonlinear second-order PDE. Its existence theory was studied since Monge; regular solutions for smooth data were established by Caffarelli (1990s), for which he received the Wolf Prize 2012 and the Abel Prize 2023.
**Why the equation is hard**. Linear elliptic PDEs (Laplace, Poisson) yield symmetric operators and spectral methods. Monge-Ampere contains $\det(D^2\varphi)$ - a product of Hessian eigenvalues. A fundamentally nonlinear object: a perturbation at one point alters the geometry globally through the product. Ellipticity holds only when $\varphi$ is convex - guaranteed by Brenier's theorem itself. Caffarelli proved: with smooth densities bounded below, $\varphi \in C^{2,\alpha}$ - so the map $T$ is continuously differentiable.
The Gaussian case is the only one with a closed form. For $\mu = \mathcal{N}(m_0, \Sigma_0)$, $\nu = \mathcal{N}(m_1, \Sigma_1)$, the Brenier potential is affine-quadratic:
From this formula one obtains a closed form for $W_2^2$ between Gaussians - used in the FID metric (Frechet Inception Distance) for assessing GAN and diffusion model quality.
**FID rests on Brenier**. Inception embeddings of real and fake images are approximated as Gaussians, and FID = $\|m_r - m_f\|^2 + \mathrm{tr}(\Sigma_r + \Sigma_f - 2(\Sigma_r^{1/2}\Sigma_f\Sigma_r^{1/2})^{1/2})$ - exactly $W_2^2$ between Gaussians via Brenier's formula. Every paper on diffusion or GAN reporting an FID number is an implicit user of the 1991 theorem.
The Monge-Ampere equation $\det(D^2\varphi) = \rho_0 / \rho_1 \circ \nabla\varphi$ is a second-order PDE. What makes it fundamentally harder than the Laplace equation?
From Brenier to WGAN and Flow Matching
The Wasserstein-2 distance is expressed via the Brenier potential directly: $W_2^2(\mu, \nu) = \int \|x - \nabla\varphi(x)\|^2 \, d\mu(x)$. This is the primal formulation - in terms of the map itself. WGAN (Arjovsky 2017) trains the Kantorovich-Rubinstein dual for $W_1$ via a 1-Lipschitz critic. Brenier offers another route: parametrize $\varphi$ directly as a convex neural network.
**Input Convex Neural Networks** (Amos et al. 2017) - architecture that constructively guarantees convexity in the input. Non-negative weights, monotone non-decreasing convex activations (ReLU, Softplus). The gradient of such a network is an optimal map by Brenier's theorem. Applications: domain adaptation in computer vision, non-equilibrium thermodynamics, 3D shape morphing.
**Continuous Normalizing Flows** (CNF, Chen et al. 2018) and **Flow Matching** (Lipman et al. 2022) go further. Instead of a single map $T$, a whole family $T_t$, $t \in [0,1]$, interpolating between $\mu$ and $\nu$. By the dynamic Benamou-Brenier formulation (2000):
subject to $\partial_t \rho_t + \nabla \cdot (\rho_t v_t) = 0$. The minimum is attained at $v_t = \nabla \varphi_t$ - and $\nabla \varphi_t$ is a straight geodesic in Wasserstein space. Flow Matching trains a neural net $v_\theta(t,x)$ to predict the optimal velocity field. Stable Diffusion 3, Meta MovieGen, Pika - all built on this idea, training 5-10x faster than classical DDPM.
**Diffusion as time-discretized OT**. DDPM (Ho et al. 2020) takes steps $x_t \to x_{t+1}$ via stochastic perturbation. In the small-step limit this is an SDE; dropping the noise term yields an ODE - a transport flow in the Brenier sense. The connection was formalized by Albergo and Vanden-Eijnden 2023, Liu et al. 2022 (Rectified Flow). Modern diffusion and flow matching are two sides of one OT coin.
**OT-Flow** (Onken et al. 2021) merges CNF and Brenier explicitly. The velocity field is parametrized as $v_t(x) = \nabla \Phi(t, x)$ for a scalar $\Phi$, trained with a kinetic-energy regularizer $\|\nabla\Phi\|^2$ - literally the Benamou-Brenier functional. The result: density estimation training tens of times faster than standard CNF, and the flow tends toward straight-line OT geodesics.
Brenier's theorem is pure 1990s mathematics with no relevance to neural networks
Brenier is the technical foundation of Wasserstein-2 geometry in ML: FID metric, ICNN, Flow Matching, OT-Flow, Rectified Flow - all rest on $T = \nabla\varphi$
The FID metric for GAN/diffusion evaluation is Brenier's formula for Gaussians. Flow Matching and Stable Diffusion 3 train $v_t = \nabla\varphi_t$ as a neural net - the parametrized Benamou-Brenier functional. Amos's ICNN delivers primal OT through a convex network. Without the 1991 theorem none of these tools would exist in their current form.
Input Convex Neural Networks (ICNN) parametrize a convex function $\varphi: \mathbb{R}^d \to \mathbb{R}$. Why is this useful for optimal transport?
Key ideas
- **Brenier's theorem 1991**: at $c = \|x-y\|^2$ the optimal Monge map exists, is unique, and equals $T = \nabla\varphi$ for convex $\varphi$. Required condition on $\mu$: absolute continuity
- **Monge-Ampere equation**: $\det(D^2\varphi) = \rho_0 / \rho_1\circ\nabla\varphi$ - a fully nonlinear elliptic PDE. Regularity by Caffarelli (1990s)
- **Gaussian case**: closed form $T(x) = m_1 + A(x-m_0)$, $A = \Sigma_0^{-1/2}(\Sigma_0^{1/2}\Sigma_1\Sigma_0^{1/2})^{1/2}\Sigma_0^{-1/2}$. Foundation of the FID metric
- **ICNN** (Amos 2017): convex neural network $\varphi$, $\nabla\varphi$ is a valid Brenier map by construction
- **Benamou-Brenier 2000**: dynamic formulation of $W_2$ via velocity field $v_t = \nabla\varphi_t$. Foundation of Flow Matching and SD3
- **Diffusion = time-discretized OT**: classical DDPM, Rectified Flow, and Flow Matching are three names for one object - a Brenier-style transport flow
What comes next
Brenier's theorem is the technical bridge between OT and modern ML:
- Wasserstein GAN — WGAN solves the dual $W_1$, Brenier handles primal $W_2$ via gradient of potential
- Flow Matching — Dynamic Brenier: $v_t = \nabla\varphi_t$ trained as a neural net
- Lagrange duality — Same primal-dual pair via convex conjugation
- Amari's dually flat structure — Information geometry: e/m-duality as a generalization of Legendre
Вопросы для размышления
- Brenier requires $\mu$ to be absolutely continuous. What happens to the theorem for discrete $\mu$ (a batch of $n$ delta functions)? How does Sinkhorn sidestep this restriction?
- The Monge-Ampere equation $\det(D^2\varphi) = \rho_0/\rho_1\circ\nabla\varphi$ is a nonlinear PDE. Why do classical spectral methods (Fourier transform, finite element) struggle here, and which numerical approaches actually work in practice?
- ICNN parametrizes a convex $\varphi$, $\nabla\text{ICNN}$ delivers an OT map. What is the cost of this restriction - what is lost compared to an arbitrary neural network, and why is it still worth paying?
- Flow Matching trains $v_\theta(t,x)$ to approximate $\nabla\varphi_t$. What conditions on $v_\theta$ are needed for the resulting flow to be an OT geodesic rather than an arbitrary interpolation between $\mu$ and $\nu$?
Связанные уроки
- ot-05-dual — Brenier is a special case of Kantorovich duality at quadratic cost
- ot-04-sinkhorn — Sinkhorn potentials approximate the Brenier potential as $\varepsilon \to 0$
- ot-07-wgan — WGAN solves the dual; Brenier gives a direct primal description of the map
- ot-11-flow-matching — Flow Matching trains $\nabla \varphi_t$ as a neural network on OT maps
- cvx-04 — Same primal-dual construction via Legendre conjugation
- ig-05-dual-flat — Amari's dually flat structure: e/m-duality through convex conjugation
- calc-01-sequences