Optimal Transport

Entropic Regularization and Schrodinger Bridge

Цели урока

Understand the connection between Schrodinger's 1931 problem and regularized OT and the Sinkhorn algorithm
Explain the convergence rate and numerical challenges at small epsilon
Apply epsilon-scaling and log-domain computation for numerically stable results

Предварительные знания

Sinkhorn algorithm and Kantorovich potentials
KL divergence and its variational properties
Stochastic differential equations: Ito formula, Fokker-Planck equation
Score matching: definition and training via denoising score matching

Denoising diffusion models generate images by running Brownian motion backwards - but they only handle Gaussian noise. What if the source is a blurry or corrupted image? Schrodinger bridges replace the Gaussian with any source distribution, making diffusion a general-purpose transport tool.

I2SB (NeurIPS 2023) uses Schrodinger bridges for state-of-the-art image super-resolution and JPEG artifact removal
Drug discovery: Schrodinger bridge models interpolate between ligand conformations to find low-energy transition paths
Climate science: Schrodinger bridges model stochastic weather transitions between observed atmospheric states
Single-cell biology: the scRNA-seq temporal interpolation problem is exactly a Schrodinger bridge between snapshot distributions

From Erwin Schrodinger's Thought Experiment to Generative AI

In 1931, Erwin Schrodinger posed a thought experiment: given a large number of Brownian particles observed at two time points with different empirical distributions, what is the most likely joint distribution of their trajectories? This is the Schrodinger bridge problem. It lay dormant for 50 years until Follmer (1988) and Jamison (1975) developed its mathematical theory. The connection to optimal transport was established by Leonard (2012) and Gentil et al. (2017). The explosion came in 2020-2023 when the ML community realized that diffusion models are Schrodinger bridges, and that the bridge framework unlocks image-to-image translation, molecular dynamics, and time series interpolation.

The Schrodinger Problem: Physics Behind the Algorithm

In 1931, Erwin Schrodinger asked: if Brownian particles start from mu and arrive at nu -- what path is most probable? Not a standard optimization problem. But in 2013 it turned out that the Schrodinger problem and regularized OT are equivalent. The Sinkhorn algorithm solves the 1931 problem.

The Schrodinger bridge is not merely an interpolation. It is the most probable evolution of a particle ensemble linking mu and nu. In generative models: direct, physically realistic trajectories from noise to data.

Schrodinger Bridge in Cell Biology

Tracking cellular differentiation from single-cell RNA-seq data (Waddington OT, Schiebinger et al. Science 2019): cells measured at different time points, trajectories recovered via Schrodinger bridge as the most probable paths compatible with observed distributions. 65,000 cells, 20,000 genes.

What is the physical meaning of the parameter epsilon in the Schrodinger problem?

In statistical physics epsilon ~ kT: at T->0 the system takes the minimum energy configuration (OT plan), at T->inf a fully random one. In OT, epsilon->0 gives deterministic transport, epsilon->inf gives uniform mixing.

Convergence Rate and Numerical Stability of Sinkhorn

Sinkhorn converges. But how fast? And how stable at small epsilon? Answer: linear convergence with rate exp(-lambda(C)/epsilon), where lambda(C) depends on the cost matrix. As epsilon->0, convergence slows exponentially. This is exactly where log-domain Sinkhorn and epsilon-scaling rescue practical computation.

Why use epsilon-scaling at small epsilon instead of running Sinkhorn directly with the target epsilon?

The convergence rate rho = exp(-lambda/epsilon) -> 1 as epsilon->0: exponentially many iterations are required. A warm start from larger epsilon provides a good initial point and drastically reduces the step count.

Entropic OT Applications: Scaling to Millions

Sinkhorn works for problems of size n=100. But what about n=100,000 points? Or batches of images during generative model training? Two approaches: mini-batch OT (approximation with bias) and stochastic Sinkhorn (online version). Both trade accuracy for scale.

In Flow Matching (Stable Diffusion 3), coupling is done via mini-batch OT with batches of 512-1024 images. Exact OT at that scale is infeasible, but mini-batch approximation is sufficient for straight trajectories.

Mini-batch OT is biased: E[W_epsilon(mu_batch, nu_batch)] != W_epsilon(mu, nu). For model quality evaluation, use full W_epsilon on the test set. For training, mini-batch is sufficient -- gradients are unbiased in expectation.

Why does mini-batch Sinkhorn give a biased estimate of W_epsilon yet is still used for training?

SGD needs unbiased gradient estimates, not an unbiased loss. E[nabla_theta W_epsilon(P_theta_batch, P_data_batch)] = nabla_theta W_epsilon(P_theta, P_data) -- holds under random sampling.

Entropy as a Bridge Between Transport and Probability

Entropic regularization reveals that optimal transport and probability theory speak the same language. Adding entropy to OT does not change the problem - it changes the perspective: from geometric transport to maximum-likelihood stochastic processes. The Schrodinger bridge is the probabilistic soul of optimal transport, and every modern generative model from diffusion to flow matching is a descendant of Schrodinger's 1931 thought experiment.

Optimal Transport — Related topic

Итоги

The Schrodinger problem = KL projection onto plans with marginals mu, nu = regularized OT
Sinkhorn convergence is linear with rate exp(-lambda/epsilon): at small epsilon thousands of iterations are needed
Log-domain Sinkhorn and epsilon-scaling are mandatory optimizations for small epsilon
Mini-batch OT is biased as an estimate of W_epsilon but unbiased as a gradient estimate -- suitable for training

Вопросы для размышления

As epsilon -> 0, the entropic OT plan converges to the exact OT plan. But the convergence rate depends on the regularity of the OT map. For what source/target distributions does the convergence slow down, and why?
The Schrodinger bridge requires the reference measure (Brownian motion) to have full support. What happens if the bridge endpoints mu and nu are not absolutely continuous? Does the bridge still exist?
I2SB conditions the score network on the degraded image x_degraded. Could one instead condition on the noise level only (like standard DDPM)? What information would be lost, and how would generation quality change?

Связанные уроки

ot-04-sinkhorn — Sinkhorn algorithm is the computational engine for entropic OT
ot-13-schrodinger — Schrodinger bridge is the stochastic-process interpretation of entropic OT
ot-21 — This lesson deepens the Sinkhorn analysis with Schrodinger bridge theory
ot-25-flow-matching — Flow matching and diffusion models use the Schrodinger bridge as a generative process
ot-15-unbalanced