Topological Data Analysis (TDA)
Цели урока
- Construct persistent homology from a Vietoris-Rips filtration
- Read persistence diagrams and barcodes as topological fingerprints of data
- Apply the stability theorem to understand TDA's robustness to noise
- Understand how persistence diagrams are integrated into machine learning pipelines
Предварительные знания
- Homology groups
- Simplicial complexes
- Metric spaces
How can a 'hole' in a million-point cloud be detected algorithmically? Persistent homology tracks holes across scales, and the stability theorem guarantees the result is meaningful even with noise.
- Bioinformatics: Ayasdi discovered a new cancer subtype c5 through persistent diagrams of genomic data
- Neuroscience: topological structure of neural firing patterns analyzed via persistent homology
- Materials science: porosity and defects in metallic glasses described by H0 and H1 of atomic positions
- Drug discovery: persistent homology of molecular graphs identifies binding site topology
From Homology to Data Analysis
Herbert Edelsbrunner and John Harer coined the term persistent homology in 2002. Afra Zomorodian and Gunnar Carlsson proved the decomposition theorem for persistence modules in 2005. Cohen-Steiner, Edelsbrunner and Harer proved the stability theorem in 2007. From 2010 onward, TDA became an industrial tool: Ripser, Gudhi, Giotto-TDA provide production-ready implementations.
TDA and Persistent Homology
Ayasdi used persistent homology in 2008 to analyze genomic data from 271 breast cancer patients, discovering a previously unknown subtype c5 with improved prognosis. The Ripser library (2016) computes persistent homology for 10,000 points in seconds. Betti numbers now extract topology from data at scale.
The Wassertein distance between persistence diagrams is stable: small data perturbations give close diagrams (Cohen-Steiner stability theorem, 2007). This makes TDA applicable to noisy real-world data - the topology is robust where the signal is genuine.
What does a long-lived point in the H1 persistence diagram represent?
Correct. A point (b,d) in PD_1 with large persistence d-b means a loop exists from scale b to d - a significant topological feature stable under noise.
Barcodes and Stability
A barcode is the 'DNA fingerprint' of the topological structure of data. Long bars are real holes and loops. Short bars are noise. Structural bioinformatics uses barcodes to compare protein structures: the Wasserstein distance between barcodes is a similarity metric robust to conformational changes.
Protein structure comparison via persistence
Application in structural bioinformatics
Two proteins with similar function but different conformations have similar 3D atomic structures. Computing PD_1 (loops) from atomic coordinates and measuring the Wasserstein distance between diagrams gives a topology-based similarity score. This is stable under small conformational changes (small perturbation = small diagram change).
What does the stability theorem guarantee for persistent homology?
Correct: d_B(PD(f), PD(g)) <= ||f-g||_inf - the key inequality for applicability to noisy real data.
Persistent Topology in Neural Networks
In 2019, a Google Brain team integrated persistent homology into graph neural networks for molecular recognition. Persistence diagrams have variable sizes - the main obstacle to using them as fixed-size neural network inputs. PersLay and persistence images solve this by mapping diagrams to fixed-length vectors.
The main obstacle to using persistence diagrams in neural networks: the diagram is not a fixed-size object. The number of points changes when data changes. PersLay, persistence images, and DeepSets architectures all address this by summing or integrating over points.
Why is it non-trivial to use persistence diagrams directly as neural network inputs?
Correct. PersLay, persistence images, and other methods solve this problem by mapping variable-size diagrams to fixed-length vectors.
Connections to other topics
TDA bridges algebraic topology, computational geometry, and machine learning.
- Machine learning — Related topic
- Bioinformatics — Related topic
- Neuroscience — Related topic
- Materials science — Related topic
Итоги
- Rips filtration: K_eps = simplicial complex connecting points within distance eps
- Persistence diagram PD_k: multiset of (birth, death) pairs for k-dimensional homological classes
- Barcode B_k = {[eps_b, eps_d)}: equivalent interval representation of the diagram
- Stability theorem: d_B(PD(f), PD(g)) <= ||f-g||_inf - robustness to noise
- PersLay: differentiable layer f(D) = sum w(b,d)*phi(b,d) for embedding diagrams in neural nets
- Persistence image: Gaussian-smoothed diagram in a fixed-size 2D grid
Вопросы для размышления
- Why are persistence diagrams stable to small noise but not to large deformations of the data?
- How does the choice of filtration (Rips vs Cech vs Alpha) affect the persistence diagram?
- Which topological features do H0, H1, and H2 each capture best, and when should all three be examined?
Связанные уроки
- top-23 — Homology groups are the building blocks of persistence diagrams
- top-28 — Advanced persistent homology builds on TDA fundamentals
- top-29 — The Mapper algorithm uses filtrations and clustering from TDA
- top-26-k-theory
- calc-25-green-theorem