Optimal Transport

Multi-Marginal Optimal Transport

Training a GAN requires computing a Wasserstein distance between the generated distribution and the real data distribution - a 2-marginal optimal transport problem. But in multi-agent reinforcement learning, fairness constraints in finance, or density functional theory in quantum chemistry, you need to transport between $k > 2$ distributions simultaneously. That is multi-marginal optimal transport.

Wasserstein barycenters in federated learning: averaging model weights across $k$ clients preserves geometry by computing the Frechet mean in Wasserstein space - a $k$-marginal OT problem. Used in FedAvg variants where simple averaging distorts learned representations.
Density functional theory (DFT): the Hohenberg-Kohn functional for $N$ electrons can be reformulated as an $N$-marginal OT problem with cost $c(x_1,\ldots,x_N) = \sum_{i<j} 1/|x_i-x_j|$. This insight (Cotar-Friesecke-Kluppelberg, 2013) connects quantum chemistry to convex optimization.
Color transfer between multiple images: transferring the color palette from $k$ source images to a target simultaneously requires a Wasserstein barycenter - the multi-marginal problem with quadratic cost.

Цели урока

Formulate the multi-marginal optimal transport problem and identify its special cases
Describe the Wasserstein barycenter problem as a $k$-marginal OT problem
Apply Sinkhorn's algorithm extended to multi-marginal problems for numerical computation

Предварительные знания

2-marginal optimal transport: Kantorovich and Monge formulations
Wasserstein distances and their properties
Entropy regularization and Sinkhorn's algorithm

Multi-marginal OT: formulation

Given probability measures $\mu_1, \ldots, \mu_k$ on spaces $X_1, \ldots, X_k$ and a cost function $c: X_1 \times \ldots \times X_k \to \mathbb{R}$, the multi-marginal OT problem is $\inf_{\gamma \in \Pi(\mu_1,\ldots,\mu_k)} \int c \, d\gamma$, where $\Pi(\mu_1,\ldots,\mu_k)$ is the set of couplings with all prescribed marginals. At $k=2$ this reduces to classical OT. The dual: $\sup_{u_1,\ldots,u_k} \sum_i \int u_i \, d\mu_i$ subject to $\sum_i u_i(x_i) \leq c(x_1,\ldots,x_k)$.

The DFT connection (Cotar-Friesecke-Kluppelberg): the Levy-Lieb functional $F[\rho] = \min_{\Psi \mapsto \rho} \langle\Psi|T+V_{ee}|\Psi\rangle$ in the strongly correlated limit ($\lambda \to \infty$) converges to an $N$-marginal OT problem with Coulomb cost $c(x_1,\ldots,x_N) = \sum_{i<j} 1/|x_i - x_j|$. This reformulation enables convex optimization methods in quantum chemistry.

Computational methods

Entropy-regularized multi-marginal OT: add $\varepsilon H(\gamma) = -\varepsilon \int \log \gamma \, d\gamma$. The solution has the form $\gamma^* = \exp\left(\frac{-c + \sum_i u_i(x_i)}{\varepsilon}\right) \prod_i \mu_i$. Sinkhorn's algorithm generalizes: alternately project onto each marginal constraint. Convergence is guaranteed and each projection step has closed form. Complexity: $O(n^k)$ for discrete distributions with $n$ support points - exponential in $k$, making large $k$ challenging.

Multi-marginal OT with $k$ marginals and $n$ support points each requires an $n^k$ coupling tensor. For $k=3$ and $n=1000$, this is $10^9$ entries - already borderline in memory. Practical algorithms exploit structure in the cost function (e.g., separability, tree structure) to avoid explicitly constructing the full tensor.

Classical 2-marginal OT was formulated by Gaspard Monge in 1781 and relaxed by K

Classical 2-marginal OT was formulated by Gaspard Monge in 1781 and relaxed by Kantorovich in 1942. The multi-marginal extension emerged independently in economics (Carlier 2003, optimal risk sharing) and quantum chemistry (Cotar-Friesecke-Kluppelberg 2013). The Wasserstein barycenter was introduced by Agueh and Carlier in 2011, becoming a central tool in computational statistics within five years.

Multi-Marginal OT: Formulation and Complexity

Classical OT couples two point clouds and seeks an optimal matching. Multi-marginal OT (MOT) couples k clouds simultaneously. The plan becomes a k-dimensional tensor, so memory and runtime scale as n^k. This monster is unavoidable for DFT of N-electron systems, Wasserstein barycenters, and time series alignment where two marginals are fundamentally insufficient.

Why is MOT with k>2 marginals exponentially harder than two-marginal OT?

DFT as N-marginal OT: Quantum Chemistry

Density Functional Theory (DFT) underpins modern quantum chemistry. The exchange-correlation functional E_xc[rho] is not known in closed form. In 2013 Cotar, Friesecke, and Klueppelberg proved that the strongly correlated limit of E_xc coincides with an N-marginal optimal transport problem with Coulomb cost. By 2024 this connection had reached AlphaFold 3, where MOT-inspired constraints regularise neural density functionals.

Why is N-marginal MOT needed for the exchange-correlation functional, given that the Coulomb interaction is pairwise?

Wasserstein Barycenters as MOT

Averaging two probability measures component-wise produces a blurry mixture that resembles neither input. The Wasserstein barycenter averages via optimal transport and preserves modes and geometry. The barycenter problem is a special case of MOT with k source marginals and a free central marginal.

Why is the Wasserstein barycenter preferable to component-wise averaging of measures?

Wasserstein barycenter as multi-marginal OT

The Wasserstein barycenter of $\mu_1, \ldots, \mu_k$ with weights $\lambda_i > 0$, $\sum \lambda_i = 1$ minimizes $\sum_i \lambda_i W_2^2(\mu, \mu_i)$ over all probability measures $\mu$. Agueh-Carlier (2011) showed this equals a $(k+1)$-marginal OT problem: jointly transport $\mu_1, \ldots, \mu_k$ and their barycenter $\mu$, with cost $c(x, x_1,\ldots,x_k) = \sum_i \lambda_i |x - x_i|^2$. The barycenter is the marginal of the optimal coupling on the first variable.

Итоги

Multi-marginal OT generalizes classical transport to $k$ distributions simultaneously.
Wasserstein barycenters, DFT, and multi-agent fairness constraints are natural instances.
Entropy-regularized MMOT extends Sinkhorn's algorithm but suffers exponential complexity in $k$.
Structural assumptions on the cost (separability, tree structure) are essential for scalability.

Connections to other topics

2-marginal OT and Sinkhorn (ot-20) are the foundation; MMOT is the multi-variable generalization. Free probability theory provides another angle: the free Wasserstein barycenter connects to non-commutative probability. In machine learning, GAN training (2-marginal) is the special case, while federated learning averaging (k-marginal) is the application that motivates MMOT.

Ot 20 — related

Вопросы для размышления

For the quadratic cost $c(x_1,\ldots,x_k) = \sum_{i<j} |x_i - x_j|^2$, show that the multi-marginal OT problem decouples: the optimal coupling concentrates on maps $T_i$ satisfying $\sum_i T_i = k\bar{x}$ where $\bar{x}$ is the barycenter location. What does this tell you about the structure of the Wasserstein barycenter in the Gaussian case?