Topology

Bottleneck and Wasserstein Distances on Persistence Diagrams

Цели урока

Understand the structure of persistence diagrams as a metric space
Compute bottleneck distance via bipartite matching
Distinguish Wasserstein from its sliced approximation
Vectorize diagrams to fit any ML pipeline

Предварительные знания

Stability theorem (top-32)
Persistent homology and barcodes
L_p norms and metric spaces
Basics of optimal transport

Without metrics on diagrams TDA does not integrate into ML pipelines.

**Protein structure comparison**: Wasserstein on persistence diagrams classifies structural classes
**Anomaly detection**: bottleneck distance flags outliers in production monitoring
**Neural-network layers**: PersLay and TopNet use sliced Wasserstein as a differentiable loss
**Medical imaging**: persistence images feed CNNs alongside raw scans

Metrics on diagrams

In 2009 Edelsbrunner and Harer formalised persistence diagrams as metric spaces in Computational Topology, the key step from a combinatorial object to a statistical one. In 2011 Mileyko, Mukherjee, and Harer introduced Wasserstein metrics and proved their properties - completeness, separability, Frechet barycentre existence. In 2017 Hofer et al. demonstrated the first differentiable neural-network training with Wasserstein on persistence diagrams. Carriere and colleagues in 2017 also introduced sliced Wasserstein, a fast approximation that opened the door to TDA integration in large-scale ML systems.

Persistence Diagrams as a Metric Space

Two chemotherapy drugs are tested on a tumor-cell culture. Both produce persistence diagrams with the same count of long features in H_0 (colony components) and H_1 (intercellular communication cycles). But a physician must know: are these drug effects genuinely similar, or does one act in a fundamentally different way? The answer demands a quantitative metric on the space of diagrams - a number with distance properties, not a visual impression.

A persistence diagram D is a multiset of points (b_i, d_i) in R^2 with b_i smaller than d_i. Each point is a feature with birth b and death d. To turn the space of all such diagrams into a metric space, one augments each D by the full diagonal Delta = (x,x) with infinite multiplicity. This yields a space of bijections between augmented diagrams and a formal apparatus for distance definitions.

Metric axioms hold for both bottleneck and Wasserstein: nonnegativity follows from inf definitions, symmetry follows from bijection invertibility, the triangle inequality follows from composing optimal matchings.

Visual intuition

Two diagrams on one picture

On the birth-death plane, points from two diagrams are plotted in different colours. An optimal matching is drawn as segments between matched points. Unmatched points are connected to their diagonal projections by short vertical segments. The longest segment in the picture is the bottleneck distance.

Each class in a dataset has a mean persistence diagram (Frechet or Karcher barycentre). The distance between class barycentres is a topological measure of separability. If bottleneck distance between barycentres is smaller than within-class spread, topology gives no classification signal.

The space of diagrams with bottleneck distance is not Euclidean: the barycentre is not unique, the mean may fail to exist. This complicates statistical methods and demands special algorithms (for example, Frechet mean via gradient descent).

Why does matching an off-diagonal point (b,d) to the diagonal in L_infinity cost (d-b)/2?

L_infinity distance from (b,d) to (m,m) is max(|b-m|, |d-m|). Minimisation over m yields m = (b+d)/2 and value (d-b)/2. The L_infinity projection differs from the Euclidean one but gives the same formula for the diagonal-perpendicular.

Bottleneck Distance: Formal Definition and Algorithm

Bottleneck distance is the infimum over bijections of the supremum of pairwise distances. The clever inf-sup combination makes it robust to bijection choice but extremely sensitive to one bad match: a single high-distance pair lifts the entire metric.

Hopcroft-Karp on a bipartite graph with n vertices and m edges runs in O(m * sqrt(n)). Binary search over epsilon requires log(N) iterations where N is the number of distinct distances. The worst-case complexity for n points is O(n^{2.5} log n), faster in practice thanks to spatial data structures.

Industrial anomaly detection

Casting quality control

On an aluminium casting line, computed tomography is performed on every hundredth part. From the 3D volume an H_1 persistence diagram is extracted (cavities in the metal). A reference diagram of defect-free parts is stored. Bottleneck distance between new and reference above the threshold 0.5 mm signals a large cavity and automatically rejects the part. The metric reacts only to the largest defect - this is its strength in anomaly detection.

k-Nearest Neighbors on persistence diagrams with bottleneck distance gives a robust baseline classifier. No vectorization required, no differentiability needed. Used as a benchmark for more complex TDA models.

When persistence is close to zero, bottleneck distance becomes sensitive to numerical noise: trivial points may spuriously inflate the distance. Practical fix - prefilter by minimum persistence (e.g. drop features with persistence under 0.01).

In the Gudhi library bottleneck is implemented via the linear algorithm of Kerber et al. (2017) with practical complexity O(n * log^2 n) on typical inputs. This makes bottleneck distance practical for diagrams with tens of thousands of points - critical for large-scale bioinformatics.

What is the computational complexity of exact bottleneck distance for two diagrams with n points each?

Bottleneck distance reduces to bottleneck matching: binary search over a threshold plus existence check via O(m * sqrt(n)) bipartite matching. With dense graphs (n^2 edges) and log(n) iterations, this gives O(n^{2.5} log n). Modern algorithms (Kerber) achieve near-linear time in practice.

p-Wasserstein Distance and Sliced Variants

Bottleneck distance answers how bad is the worst match. Wasserstein metrics answer a deeper question: how different are two diagrams overall. Wasserstein distance is the minimum cost of transporting mass from one diagram to the other, where cost is a sum (or L_p sum) of individual displacements.

Mileyko, Mukherjee, Harer (2011) formally introduced Wasserstein metrics on persistence diagrams and proved their properties. Carriere et al. (2017) introduced sliced Wasserstein - a fast, differentiable approximation.

PersLay in neural networks

Topological layer

The PersLay architecture takes a persistence diagram as a layer input. Each diagram point passes through a learned embedding. Sliced Wasserstein loss between a batch of diagrams and a learned template provides gradients on embedding parameters. The network thus learns to represent topological features in a format optimal for subsequent fully-connected layers.

Empirical observations: p=1 (EMD) works best on texture tasks, p=2 on shapes and medical imaging, p=infinity (bottleneck) on anomaly detection. p=2 is most popular in modern architectures thanks to favourable gradient behaviour.

Sliced Wasserstein is an approximation, not the full W_p. On simple distributions the error is small; on complex ones it can be substantial. In critical tasks (medical diagnosis) prefer exact Wasserstein even when slower.

This limit explains why bottleneck distance equals W_infinity: as p grows, the sum of p-th powers is dominated by the largest term. In practice W_p for large p (e.g. p=50) gives a result close to bottleneck but with better gradient stability for optimisation.

Why is sliced Wasserstein preferred in neural networks instead of full Wasserstein?

Training neural networks needs fast gradients. Sliced Wasserstein is implemented via projections plus 1D sorting, all differentiable and parallelisable on GPU. Full Wasserstein requires solving an optimal transport problem at every forward pass.

Diagram Vectorization and Computational Aspects

Direct computation of bottleneck or Wasserstein between large diagrams is expensive. Production TDA systems rarely work with pairwise distances directly. Instead diagrams are vectorized - converted to fixed-size representations compatible with any ML library. This yields huge speed-ups and GPU compatibility.

Three popular vectorization techniques: (1) persistence image - a grid over the birth-death plane, weighted by persistence, smoothed with Gaussian, (2) Betti curve - the function beta_k(epsilon) over the filtration level, (3) persistence entropy - a single scalar feature capturing lifetime distribution dispersion.

Pipeline for 10000 proteins

Topological structure classification

10000 PDB protein structures, each converted into a cloud of 1000-5000 atoms. Vietoris-Rips persistence is computed for each structure with the GPU-accelerated library Ripser++. Each diagram is vectorized into a 50x50 = 2500-feature persistence image. XGBoost is trained to classify structural classes (alpha-helix, beta-sheet, mixed). Test accuracy reaches about 92 percent. Without vectorization, pairwise Wasserstein would require 10000^2 / 2 = 50 million computations.

A persistence image is a fixed-size greyscale picture. A convolutional neural network accepts it as input out of the box. ResNet18-style architectures on persistence images achieve competitive results on medical image classification with minimal engineering.

Every vectorization loses structure. Persistence image smears nearby points - fine features may merge. Betti curve ignores joint information of points. Persistence entropy collapses everything to one number. Choosing a vectorization is an engineering decision driven by the task.

For sparse diagrams (a few dozen points) direct metrics like bottleneck or Wasserstein are more practical. For dense diagrams (thousands of points) - vectorization plus classical ML algorithms. For differentiable pipelines - sliced Wasserstein. The choice depends on size, speed requirement, and GPU availability.

Why are persistence images more convenient than direct metrics for classification with 10000+ samples?

Pairwise bottleneck on N samples requires N^2 / 2 computations, each O(n^{2.5}). This is unfeasible above a few thousand samples. Vectorization makes features independent per sample: compute N vectors, then plug into any sklearn/pytorch model.

Where the metrics line continues

Metrics on diagrams unlock TDA in every ML pipeline.

Mapper advanced — Related topic
TDA for time series — Related topic
TDA in neural networks — Related topic
Witness complexes — Related topic

Key ideas

Persistence diagrams form a metric space when augmented with the diagonal
Bottleneck distance tracks the worst match, Wasserstein tracks the total
Sliced Wasserstein is differentiable and embeddable in neural networks
Vectorization (persistence image, Betti curve) makes diagrams ML-friendly
Choice of metric and representation depends on data size and task

Вопросы для размышления

When are direct metrics preferable, and when is vectorization?
Why is sliced Wasserstein differentiability vital for modern architectures?
How would TDA status differ without Wasserstein metrics?

Связанные уроки

top-32 — stability theorem introduces both metrics
top-28 — persistence modules for interleaving
top-31 — shape of data as comparison object
top-37 — TDA losses in neural networks via Wasserstein
mt-05