Topological Data Analysis (TDA)

Цели урока

Construct persistent homology from a Vietoris-Rips filtration
Read persistence diagrams and barcodes as topological fingerprints of data
Apply the stability theorem to understand TDA's robustness to noise
Understand how persistence diagrams are integrated into machine learning pipelines

Предварительные знания

Homology groups
Simplicial complexes
Metric spaces

Homology Groups

How can a 'hole' in a million-point cloud be detected algorithmically? Persistent homology tracks holes across scales, and the stability theorem guarantees the result is meaningful even with noise.

Bioinformatics: Ayasdi discovered a new cancer subtype c5 through persistent diagrams of genomic data
Neuroscience: topological structure of neural firing patterns analyzed via persistent homology
Materials science: porosity and defects in metallic glasses described by H0 and H1 of atomic positions
Drug discovery: persistent homology of molecular graphs identifies binding site topology

From Homology to Data Analysis

Herbert Edelsbrunner and John Harer coined the term persistent homology in 2002. Afra Zomorodian and Gunnar Carlsson proved the decomposition theorem for persistence modules in 2005. Cohen-Steiner, Edelsbrunner and Harer proved the stability theorem in 2007. From 2010 onward, TDA became an industrial tool: Ripser, Gudhi, Giotto-TDA provide production-ready implementations.

TDA and Persistent Homology

Ayasdi used persistent homology in 2008 to analyze genomic data from 271 breast cancer patients, discovering a previously unknown subtype c5 with improved prognosis. The Ripser library (2016) computes persistent homology for 10,000 points in seconds. Betti numbers now extract topology from data at scale.

The Wassertein distance between persistence diagrams is stable: small data perturbations give close diagrams (Cohen-Steiner stability theorem, 2007). This makes TDA applicable to noisy real-world data - the topology is robust where the signal is genuine.

What does a long-lived point in the H1 persistence diagram represent?

Correct. A point (b,d) in PD_1 with large persistence d-b means a loop exists from scale b to d - a significant topological feature stable under noise.

Barcodes and Stability

A barcode is the 'DNA fingerprint' of the topological structure of data. Long bars are real holes and loops. Short bars are noise. Structural bioinformatics uses barcodes to compare protein structures: the Wasserstein distance between barcodes is a similarity metric robust to conformational changes.

Protein structure comparison via persistence

Application in structural bioinformatics

Two proteins with similar function but different conformations have similar 3D atomic structures. Computing PD_1 (loops) from atomic coordinates and measuring the Wasserstein distance between diagrams gives a topology-based similarity score. This is stable under small conformational changes (small perturbation = small diagram change).

What does the stability theorem guarantee for persistent homology?

Correct: d_B(PD(f), PD(g)) <= ||f-g||_inf - the key inequality for applicability to noisy real data.

Persistent Topology in Neural Networks

In 2019, a Google Brain team integrated persistent homology into graph neural networks for molecular recognition. Persistence diagrams have variable sizes - the main obstacle to using them as fixed-size neural network inputs. PersLay and persistence images solve this by mapping diagrams to fixed-length vectors.

The main obstacle to using persistence diagrams in neural networks: the diagram is not a fixed-size object. The number of points changes when data changes. PersLay, persistence images, and DeepSets architectures all address this by summing or integrating over points.

Why is it non-trivial to use persistence diagrams directly as neural network inputs?

Correct. PersLay, persistence images, and other methods solve this problem by mapping variable-size diagrams to fixed-length vectors.

Connections to other topics

TDA bridges algebraic topology, computational geometry, and machine learning.

Machine learning — Related topic
Bioinformatics — Related topic
Neuroscience — Related topic
Materials science — Related topic

Итоги

Rips filtration: K_eps = simplicial complex connecting points within distance eps
Persistence diagram PD_k: multiset of (birth, death) pairs for k-dimensional homological classes
Barcode B_k = {[eps_b, eps_d)}: equivalent interval representation of the diagram
Stability theorem: d_B(PD(f), PD(g)) <= ||f-g||_inf - robustness to noise
PersLay: differentiable layer f(D) = sum w(b,d)*phi(b,d) for embedding diagrams in neural nets
Persistence image: Gaussian-smoothed diagram in a fixed-size 2D grid

Вопросы для размышления

Why are persistence diagrams stable to small noise but not to large deformations of the data?
How does the choice of filtration (Rips vs Cech vs Alpha) affect the persistence diagram?
Which topological features do H0, H1, and H2 each capture best, and when should all three be examined?

Связанные уроки

top-23 — Homology groups are the building blocks of persistence diagrams
top-28 — Advanced persistent homology builds on TDA fundamentals
top-29 — The Mapper algorithm uses filtrations and clustering from TDA
top-26-k-theory
calc-25-green-theorem