Optimal Transport
Entropic Regularization and Schrodinger Bridge
Цели урока
- Understand the connection between Schrodinger's 1931 problem and regularized OT and the Sinkhorn algorithm
- Explain the convergence rate and numerical challenges at small epsilon
- Apply epsilon-scaling and log-domain computation for numerically stable results
Предварительные знания
- Sinkhorn algorithm and Kantorovich potentials
- KL divergence and its variational properties
- Stochastic differential equations: Ito formula, Fokker-Planck equation
- Score matching: definition and training via denoising score matching
Denoising diffusion models generate images by running Brownian motion backwards - but they only handle Gaussian noise. What if the source is a blurry or corrupted image? Schrodinger bridges replace the Gaussian with any source distribution, making diffusion a general-purpose transport tool.
- I2SB (NeurIPS 2023) uses Schrodinger bridges for state-of-the-art image super-resolution and JPEG artifact removal
- Drug discovery: Schrodinger bridge models interpolate between ligand conformations to find low-energy transition paths
- Climate science: Schrodinger bridges model stochastic weather transitions between observed atmospheric states
- Single-cell biology: the scRNA-seq temporal interpolation problem is exactly a Schrodinger bridge between snapshot distributions
From Erwin Schrodinger's Thought Experiment to Generative AI
In 1931, Erwin Schrodinger posed a thought experiment: given a large number of Brownian particles observed at two time points with different empirical distributions, what is the most likely joint distribution of their trajectories? This is the Schrodinger bridge problem. It lay dormant for 50 years until Follmer (1988) and Jamison (1975) developed its mathematical theory. The connection to optimal transport was established by Leonard (2012) and Gentil et al. (2017). The explosion came in 2020-2023 when the ML community realized that diffusion models are Schrodinger bridges, and that the bridge framework unlocks image-to-image translation, molecular dynamics, and time series interpolation.
The Schrodinger Problem: Physics Behind the Algorithm
In 1931, Erwin Schrodinger asked: if Brownian particles start from mu and arrive at nu -- what path is most probable? Not a standard optimization problem. But in 2013 it turned out that the Schrodinger problem and regularized OT are equivalent. The Sinkhorn algorithm solves the 1931 problem.
The Schrodinger bridge is not merely an interpolation. It is the most probable evolution of a particle ensemble linking mu and nu. In generative models: direct, physically realistic trajectories from noise to data.
Schrodinger Bridge in Cell Biology
Tracking cellular differentiation from single-cell RNA-seq data (Waddington OT, Schiebinger et al. Science 2019): cells measured at different time points, trajectories recovered via Schrodinger bridge as the most probable paths compatible with observed distributions. 65,000 cells, 20,000 genes.
What is the physical meaning of the parameter epsilon in the Schrodinger problem?
In statistical physics epsilon ~ kT: at T->0 the system takes the minimum energy configuration (OT plan), at T->inf a fully random one. In OT, epsilon->0 gives deterministic transport, epsilon->inf gives uniform mixing.
Convergence Rate and Numerical Stability of Sinkhorn
Sinkhorn converges. But how fast? And how stable at small epsilon? Answer: linear convergence with rate exp(-lambda(C)/epsilon), where lambda(C) depends on the cost matrix. As epsilon->0, convergence slows exponentially. This is exactly where log-domain Sinkhorn and epsilon-scaling rescue practical computation.
Why use epsilon-scaling at small epsilon instead of running Sinkhorn directly with the target epsilon?
The convergence rate rho = exp(-lambda/epsilon) -> 1 as epsilon->0: exponentially many iterations are required. A warm start from larger epsilon provides a good initial point and drastically reduces the step count.
Entropic OT Applications: Scaling to Millions
Sinkhorn works for problems of size n=100. But what about n=100,000 points? Or batches of images during generative model training? Two approaches: mini-batch OT (approximation with bias) and stochastic Sinkhorn (online version). Both trade accuracy for scale.
In Flow Matching (Stable Diffusion 3), coupling is done via mini-batch OT with batches of 512-1024 images. Exact OT at that scale is infeasible, but mini-batch approximation is sufficient for straight trajectories.
Mini-batch OT is biased: E[W_epsilon(mu_batch, nu_batch)] != W_epsilon(mu, nu). For model quality evaluation, use full W_epsilon on the test set. For training, mini-batch is sufficient -- gradients are unbiased in expectation.
Why does mini-batch Sinkhorn give a biased estimate of W_epsilon yet is still used for training?
SGD needs unbiased gradient estimates, not an unbiased loss. E[nabla_theta W_epsilon(P_theta_batch, P_data_batch)] = nabla_theta W_epsilon(P_theta, P_data) -- holds under random sampling.
Entropy as a Bridge Between Transport and Probability
Entropic regularization reveals that optimal transport and probability theory speak the same language. Adding entropy to OT does not change the problem - it changes the perspective: from geometric transport to maximum-likelihood stochastic processes. The Schrodinger bridge is the probabilistic soul of optimal transport, and every modern generative model from diffusion to flow matching is a descendant of Schrodinger's 1931 thought experiment.
- Optimal Transport — Related topic
Итоги
- The Schrodinger problem = KL projection onto plans with marginals mu, nu = regularized OT
- Sinkhorn convergence is linear with rate exp(-lambda/epsilon): at small epsilon thousands of iterations are needed
- Log-domain Sinkhorn and epsilon-scaling are mandatory optimizations for small epsilon
- Mini-batch OT is biased as an estimate of W_epsilon but unbiased as a gradient estimate -- suitable for training
Вопросы для размышления
- As epsilon -> 0, the entropic OT plan converges to the exact OT plan. But the convergence rate depends on the regularity of the OT map. For what source/target distributions does the convergence slow down, and why?
- The Schrodinger bridge requires the reference measure (Brownian motion) to have full support. What happens if the bridge endpoints mu and nu are not absolutely continuous? Does the bridge still exist?
- I2SB conditions the score network on the degraded image x_degraded. Could one instead condition on the noise level only (like standard DDPM)? What information would be lost, and how would generation quality change?
Связанные уроки
- ot-04-sinkhorn — Sinkhorn algorithm is the computational engine for entropic OT
- ot-13-schrodinger — Schrodinger bridge is the stochastic-process interpretation of entropic OT
- ot-21 — This lesson deepens the Sinkhorn analysis with Schrodinger bridge theory
- ot-25-flow-matching — Flow matching and diffusion models use the Schrodinger bridge as a generative process
- ot-15-unbalanced