Optimal Transport
Gromov-Wasserstein
How can a protein shape be compared to a galaxy shape? The protein is encoded as an amino acid contact matrix; the galaxy as a stellar density map. They have no shared space, so Wasserstein cannot be applied. It needs a common metric to price transport. Facundo Memoli (2011) changed the question: not 'where to send each point', but 'which transport plan preserves pairwise distances within each object'. That is Gromov-Wasserstein.
- **Brain connectomes:** aligning neural connectivity graphs across subjects via FGW - graph topology + regional activation profiles simultaneously matched
- **Single-cell multi-omics:** integrating RNA-seq and ATAC-seq data via GW - chromatin accessibility and gene expression live in different spaces, GW finds cell correspondences
- **Shape analysis:** comparing 3D molecular shapes for drug discovery - GW is invariant to rotations and reflections, Wasserstein is not
Предварительные знания
Gromov-Wasserstein formulation
How can the shape of a protein be compared to the shape of a galaxy when they live in completely different metric spaces? The protein is described by an amino acid contact map; the galaxy by a stellar density map. There is no shared ambient space, so classical Wasserstein distance cannot be applied, because it needs a common metric to measure transport cost. Facundo Memoli (2011) reframed the question: instead of asking 'where to move each point', ask 'which transport plan best preserves pairwise distances within each object'. That is Gromov-Wasserstein.
Formally: given metric measure spaces $(X, d_X, \mu)$ and $(Y, d_Y, \nu)$. A transport plan $T$ is a probability measure on $X \times Y$ with marginals $\mu$ and $\nu$. The Gromov-Wasserstein problem:
**Gromov-Wasserstein problem (Memoli 2011):** $GW(\mu, \nu) = \min_{T \in \Pi(\mu, \nu)} \iint_{X \times Y} \iint_{X \times Y} |d_X(x, x') - d_Y(y, y')|^2 \, dT(x,y) \, dT(x',y')$ Cost of transport: how much does the distance between $x$ and $x'$ in space $X$ differ from the distance between their images $y$ and $y'$ in space $Y$. **Discrete case** (distance matrices $C^X \in \mathbb{R}^{m \times m}$, $C^Y \in \mathbb{R}^{n \times n}$): $GW = \min_{T \in \Pi(\mu, \nu)} \sum_{i,j,k,l} |C^X_{ik} - C^Y_{jl}|^2 T_{ij} T_{kl}$ where $T_{ij} \geq 0$, $\sum_j T_{ij} = \mu_i$, $\sum_i T_{ij} = \nu_j$.
Why can classical Wasserstein distance not be used to compare a protein shape to a galaxy shape?
Quadratic nature of GW and Fused GW
The GW functional is quadratic in $T$: it contains $T_{ij} T_{kl}$, not just $T_{ij}$. This is a fundamental difference from Wasserstein, which is linear in $T$ and therefore solvable as a linear program. GW is a quadratic program over the transportation polytope, making it NP-hard in general to find the global optimum.
**Fused Gromov-Wasserstein (Titouan et al. 2019):** When points have both structure (graph/metric) and features, Fused GW balances both aspects: $FGW_{\alpha}(\mu, \nu) = \min_{T \in \Pi} \alpha \cdot GW(T) + (1-\alpha) \cdot W(T)$ where $\alpha \in [0,1]$ controls the trade-off between structural (GW) and feature (Wasserstein) terms. **FGW applications:** - **Brain connectomes:** graph of neural connections + region activations. GW aligns topology, W aligns functional profiles - **Single-cell RNA-seq:** cells from two samples live in different expression spaces. FGW finds correspondences preserving both transcriptomic profiles (W) and pairwise cell distances (GW)
Why is the GW problem harder to solve than the Wasserstein problem?
Computing GW: Frank-Wolfe and Sliced GW
The standard approach to GW is the Frank-Wolfe method (alternating projections): linearize the quadratic functional around the current $T$, solve the linearized problem as Wasserstein (LP), step toward that solution. Each iteration costs $O(m^2 n^2)$ (tensor operations on the cost tensor) plus solving a Wasserstein problem. This is prohibitive for large graphs.
**Frank-Wolfe algorithm for GW:** 1. Initialize: $T^{(0)} = \mu \nu^\top$ (independent transport plan) 2. Iteration $t$: - Compute gradient: $\nabla_T \mathcal{L}(T^{(t)}) = $ linear function of $T^{(t)}$ - Solve linearized problem: $S = \arg\min_{T'\in\Pi} \langle \nabla_T \mathcal{L}, T' \rangle$ (a W problem) - Update: $T^{(t+1)} = (1-\gamma_t) T^{(t)} + \gamma_t S$ 3. Converges to a local optimum **Sliced Gromov-Wasserstein (Vayer et al. 2019):** Project both distributions onto random one-dimensional subspaces, compute 1D GW (solved analytically), average. Cost: $O(mn \log(mn))$ instead of $O(m^2 n^2)$. Biased estimator, but scales to large graphs.
GW distance is just Wasserstein applied to distance matrices (apply W to rows of C^X and C^Y)
GW is a joint optimization over transport plan T that simultaneously aligns pairwise distances of both spaces. Applying W to distance matrices gives a different functional without the geometric meaning of GW
In GW the variable T appears twice: T_{ij} accounts for pair (x_i, y_j) AND for pair (x_k, y_l) through the product T_{ij} T_{kl}. This quadratic coupling is what makes the problem structurally different from the linear Wasserstein transport
What does the Frank-Wolfe method guarantee when applied to GW?
Key ideas
- **GW = transport of pairwise distances:** $\min_T \iint |d_X(x,x') - d_Y(y,y')|^2 dT dT$. Applicable when no shared space exists
- **Quadratic = NP-hard:** unlike W (linear in T, solved as LP), GW is quadratic. Frank-Wolfe converges to a local optimum
- **Fused GW:** $\alpha \cdot GW + (1-\alpha) \cdot W$ balances structure and features. Used in brain connectomes and single-cell biology
- **POT library:** `ot.gromov.gromov_wasserstein`, `ot.gromov.fused_gromov_wasserstein`. Sliced GW for scalability
Related topics
GW extends optimal transport theory beyond a single ambient space:
- Wasserstein Distance — GW generalizes W to the case of different metric spaces
- Sinkhorn Algorithm — Sinkhorn solves W with entropic regularization; analogous entropic regularization applies to GW
Вопросы для размышления
- GW is invariant to isometries (rotations, reflections) of each space. Why is this property critical for comparing molecular shapes?
- Fused GW with alpha=0 reduces to Wasserstein, with alpha=1 to pure GW. What value of alpha makes sense when graph structure matters more than vertex features?
- Frank-Wolfe converges to a local GW optimum. How does the choice of initial plan T^(0) affect the final result?
Связанные уроки
- ot-03-wasserstein — GW generalizes Wasserstein to different metric spaces
- ot-04-sinkhorn — Sinkhorn solves W with entropic regularization; analogous iteration exists for GW
- calc-01-sequences