Topology

The shape of data: why topology sees what statistics misses

**Twelve datasets. All share the same mean, variance, and correlation down to the second decimal place.** One is shaped like a dinosaur. Another is a star. A third is a circle. Statistics cannot tell them apart. Sklearn produces the same numbers. Pandas shows the same tables. The eye separates them in half a second. This example, the Datasaurus dozen by Justin Matejka in 2017, is not a visual trick. It is a diagnosis: classical statistics is blind when the signal lives in the **shape**.

Topological data analysis (TDA) is the discipline that counts shape. Not as a bag of pixels, but as a set of invariants: how many pieces, how many holes, how many voids. The same invariants work in R^2, R^3, and R^512. The same invariants find cancer subtypes in genomic data, cyclic patterns in neural network activations, and anomalies in IoT telemetry. One formalism, many domains.

Цели урока

  • Understand why statistics and topology look at different sides of data
  • Formalize 'shape' through Betti numbers $b_0, b_1, b_2$
  • Build a Vietoris-Rips complex from a point cloud
  • See where TDA is already in production: genomics, neural networks, time series
  • Feel the difference between 'cluster' (statistics) and 'connected component' (topology)

Предварительные знания

  • Basic set theory: $A \cup B$, $A \cap B$, subsets
  • Metric: Euclidean distance $\|x - y\|$ in $\mathbb{R}^n$
  • Minimum linear algebra: a point as a vector, graphs as pairs of nodes

A clustering algorithm sees a cloud of points and asks: 'how many groups should I split it into?' Topology asks something different: 'is there a hole in this cloud?' The difference is fundamental. K-means on ring-shaped data slices the ring into wedges and never notices that the data is one whole with a void in the middle. Persistent homology says in one pass: $b_0 = 1$ (one component), $b_1 = 1$ (one hole). A ring. No false splits.

  • **Cancer subtypes (Carlsson 2011, Nature)**: TDA on 56 breast cancer expression profiles uncovered the c-MYB subtype, which slipped past hierarchical clustering. That subtype has 100% 10-year survival, a therapeutically critical finding.
  • **Neural network activations**: Naitzat et al. 2020 showed that deep networks 'untangle' the topology of data layer by layer. The Betti numbers of the input cloud drop monotonically toward the output, a formal measure of 'representation simplification'.
  • **Time series in finance**: Gidea and Katz 2018 caught a topological anomaly two weeks before the 2008 crash. Ordinary indicators were silent.
  • **Drug discovery**: Ayasdi (Carlsson's spinoff, raised USD 106M VC) used TDA to classify protein conformations. The product was acquired by Symphony AI.
  • **Sensor networks**: TDA detects 'coverage holes' in sensor fields without knowing coordinates, using only the signal correlation matrix.
  • **IoT anomalies**: topological descriptors stay stable under noisy sensor data where z-scores throw false positives on every other sample.

From pure math to production in 30 years

Topology as the science of shape took shape in Poincare's 1895 papers. For over a century it lived in pure mathematics, describing manifolds and invariants. The turning point was 2009. Carlsson, a Stanford professor, published 'Topology and data' in the Bulletin of the AMS, a manifesto stating that persistent homology algorithms were ready for real data. At the same time Edelsbrunner and Harer released the 'Computational Topology' textbook, the first systematic reference. That same year Carlsson founded Ayasdi, a Stanford spinoff that raised USD 106M in venture funding and sold to Symphony AI in 2019. Since 2015 TDA has had Python libraries giotto-tda, ripser, and gudhi. Classical mathematics became pip-installable in 4 seconds.

Concept 1: statistics is blind, topology sees

Anscombe's quartet, 1973. Four datasets of 11 points each. Mean of $x$, mean of $y$, variance of $x$, variance of $y$, correlation $\rho$, regression line, **all match to the third decimal**. Yet the plots look radically different: a cloud, a parabola, an outlier with a line, an outlier in $x$. Anscombe drew them by hand to make a point to students: numbers lie, plots do not. Forty-four years later Justin Matejka and George Fitzmaurice push the idea to the extreme: the Datasaurus dozen, 13 datasets with identical statistics, one of which is shaped like a dinosaur.

**A topological invariant** is a number (or structure) that does not change under continuous deformation. Stretching, bending, crumpling preserve the invariant. Tearing or gluing change it. Betti numbers $b_0, b_1, b_2$ are exactly such invariants: components, cycles, voids. These numbers count **homology**, linear algebra over chains of simplices.

Concept 2: what is 'shape', Betti numbers

Shape is a fuzzy word. Topology pins it down through **holes of different dimensions**. The surface of a sphere and the surface of a torus are both 2-dimensional and compact. The difference: the torus has a cycle that cannot be contracted to a point. The sphere has no such cycle. One bit of information separates the ball from the doughnut.

**Betti numbers: a catalog of holes.** $b_k$ counts the independent $k$-dimensional holes in a space $X$: - $b_0$ counts connected components - $b_1$ counts one-dimensional cycles (loops) - $b_2$ counts two-dimensional voids (cavities) - $b_k$, $k \geq 3$ counts higher-dimensional holes **Canonical examples:** - point: $b_0 = 1$, the rest are 0 - circle $S^1$: $b_0 = 1, b_1 = 1$ - sphere $S^2$: $b_0 = 1, b_1 = 0, b_2 = 1$ - torus $T^2$: $b_0 = 1, b_1 = 2, b_2 = 1$ - letter B: $b_0 = 1, b_1 = 2$ The Euler characteristic is the alternating sum:

**Homology** is the tool that turns 'holes' into linear algebra. Simplices (points, edges, triangles, tetrahedra) form chains. The boundary operator $\partial$ takes a chain to its boundary. A cycle is something with zero boundary. A boundary is something that is the boundary of a higher-dimensional thing. Homology $H_k = \ker \partial_k / \operatorname{im} \partial_{k+1}$, the quotient 'cycles by boundaries'. The dimension of $H_k$ equals $b_k$. Details are in the `tda-03-homology` lesson.

Concept 3: Vietoris-Rips, from cloud to complex

A point cloud is not a topological object. Points are discrete and carry no connectivity or cycles in the proper sense. The magic of TDA is to build a **simplicial complex**, a discrete model of a space, and to compute its homology. The construction is the Vietoris-Rips complex. The idea is brutally simple: connect points if they are close enough.

**Vietoris-Rips complex $\mathrm{VR}_\varepsilon(X)$:** Given a finite point set $X$ and a threshold $\varepsilon > 0$. - A $k$-simplex on points $\{x_0, \dots, x_k\}$ is included if and only if every **pairwise** distance is at most $\varepsilon$: $$\|x_i - x_j\| \leq \varepsilon \quad \forall i, j$$ In other words: edges between every pair of points at distance $\leq \varepsilon$, triangles when all three pairwise edges already exist, tetrahedra when all six do, and so on.

**Vietoris-Rips versus Cech.** VR uses pairwise distances. Cheap to compute, easy to encode. The Cech complex requires every ball $B(x_i, \varepsilon / 2)$ to share a common point. It is topologically 'more correct' (Nerve theorem) but more expensive. In practice it is almost always VR. Note: $\mathrm{VR}_\varepsilon \supseteq \mathrm{C}_{\varepsilon}$, VR is greedier.

Where TDA is already in production

TDA left the academic niche when three factors lined up: (1) fast persistent homology algorithms via matrix reduction, (2) Python libraries with decent DX, (3) the realization that topological descriptors can flow into any ML pipeline as ordinary feature vectors.

**TDA is not a silver bullet.** On very noisy or very small data (n < 50) persistence diagrams become unstable. The point cloud dimension affects cost: $\mathrm{VR}_\varepsilon$ on 10k points in R^100 is still computable, on 100k it already needs witness complexes. And the main caveat: TDA tells you WHAT is in the data, not always WHY. Interpretation stays with the domain expert.

Statistics is blind, topology sees

Datasaurus dozen: 13 datasets with identical mean, variance, and correlation, but radically different shapes. Classical statistics cannot tell them apart. Persistent homology says in one pass: the circle has b_1 = 1 (a hole), the dino has b_1 = 0.

Two datasets share the same mean, variance, and correlation. What does that tell us?

Moments are aggregated statistics, the shape is lost. Betti numbers b_0 (components), b_1 (loops), b_2 (voids) are invariant under continuous deformations and they tell circle from dino from bullseye even when the statistical moments line up.

Betti numbers: a catalog of holes

b_0 counts connected components, b_1 counts one-dimensional cycles (loops), b_2 counts two-dimensional voids. Torus: b_0 = 1, b_1 = 2, b_2 = 1. Circle S^1: b_0 = 1, b_1 = 1. Sphere S^2: b_0 = 1, b_1 = 0, b_2 = 1.

The digit 8 (figure-eight) as a one-dimensional figure. What are its Betti numbers?

One connected component (b_0 = 1), two independent loops (b_1 = 2). Same as the letter B. A circle has b_1 = 1, a figure-eight has b_1 = 2.

Vietoris-Rips: from cloud to complex

VR_eps(X): a k-simplex on points {x_0, ..., x_k} is included if every pairwise distance is <= eps. As eps grows, topological features are born and die. Long-lived ones (high persistence) are signal, short-lived ones are noise.

Points {(0, 0), (1, 0), (0, 1)} form a right triangle with sides 1, 1, sqrt(2). At what minimum eps does the VR complex become a 2-simplex?

The 2-simplex condition: every pairwise distance <= eps. The largest pairwise distance is the hypotenuse = sqrt(2). Only at eps >= sqrt(2) are all three edges present and the triangle is filled in.

Datasaurus: same numbers, different shapes

An experiment that makes 'Pearson correlation' sound like wishful thinking

All 13 datasets have: - mean of $x$ ~ 54.26 - mean of $y$ ~ 47.83 - standard deviation of $x$ ~ 16.76 - standard deviation of $y$ ~ 26.93 - Pearson correlation ~ -0.06 Shapes: - dino: a literal dinosaur - circle: a ring with a void in the center - bullseye: two concentric rings - star: a star - x-shape: a cross - ... KMeans with k = 3 on 'circle' and 'star' gives similar centroids. UMAP/t-SNE without specific tuning collapses the circle into an oval. And persistent homology says: - circle: b_0 = 1, b_1 = 1 (one component, one hole) - bullseye: b_0 = 1, b_1 = 2 (one component, two holes) - dino: b_0 = 1, b_1 = 0 (one component, no holes) Three invariants and the shapes are unambiguously separated.

Two datasets have identical mean, variance, and correlation. Which statement is correct?

Exactly. Moments are aggregated statistics, shape is lost. TDA recovers information about connectivity and cycles.

'TDA is just fancy visualization'

TDA produces quantitative invariants that work in 1000-dimensional data where visualization is impossible.

Persistent homology returns a barcode, a set of birth and death intervals for topological features. That object supports ML: bottleneck distance gives a metric, persistence images give fixed-length vectors for a CNN. TDA does not replace the eyes, it extends them into dimensions the eyes cannot reach.

The shape of data in real systems

Where b_0, b_1, b_2 stop being abstractions

**Genomics, breast cancer (Carlsson 2011):** Patients as points in R^28000 (gene expression). Goal: identify subtypes (connected components): $b_0 \to$ clusters. Unexpectedly $b_1 = 1$ for the 'normal-like' region, a ring pattern. Reading: a continuous spectrum of intermediate states. One topological number, a whole biological hypothesis. **ResNet-50 activations on ImageNet (Naitzat 2020):** Measurements after each block. At the input $b_0 = 100$ for 100 classes. By the last layer $b_0 = 1$, all classes are linearly separable. Monotone decay of Betti numbers, a formal picture of 'training simplifies topology'. **EEG in epilepsy (Petri 2014):** Channel correlation matrix as a metric. Persistent cycles $b_1$ spike sharply 30 seconds before a seizure. A clinical predictor without feature engineering. **Sensor coverage:** A sensor network covers a field. A coverage gap is $b_1 \geq 1$ on the radius-intersection graph. TDA detects the vulnerability without GPS.

The digit 8 (figure-eight) as a one-dimensional figure. What are its Betti numbers?

Exactly. One component, two independent loops. Same for the letter 'B'. A circle has $b_1 = 1$, a figure-eight has $b_1 = 2$, a trefoil knot as a graph embedding also has $b_1 = 2$.

VR on a circle of 8 points

The cleanest example of a topological hole being born

8 points evenly placed on a circle of radius 1. Pairwise distances (between circle neighbors): about 0.77. Next-to-neighbor: ~ 1.41. Diametrically opposite: 2.0. eps = 0.5: 8 isolated points. $b_0 = 8, b_1 = 0$. eps = 0.8: edges between neighbors. A ring! $b_0 = 1, b_1 = 1$. The hole appears. eps = 1.0: edges between next neighbors. A star on top of the ring, triangles start to fill in, the hole still exists. $b_0 = 1, b_1 = 1$. eps = 1.5: almost every pair is connected, triangles cover the center. The hole fills. $b_0 = 1, b_1 = 0$. eps = 2.0: the complete complex on 8 points. Everything contracts. $b_0 = 1, b_1 = 0$. The hole $b_1 = 1$ lived on the interval eps in [0.8, 1.5). The length of the interval = 'persistence' = 0.7, a strong signal that the ring structure is real, not noise. Short intervals (length 0.05, say) are almost always noise.

Points $\{(0, 0), (1, 0), (0, 1)\}$ form a right triangle with sides 1, 1, $\sqrt{2}$. At what minimum $\varepsilon$ does the VR complex become a 2-simplex (a filled triangle)?

Exactly. The 2-simplex condition is that every pairwise distance is $\leq \varepsilon$. The maximum is $\sqrt{2}$. Only at $\varepsilon \geq \sqrt{2}$ are all three edges present and the triangle is filled in.

  • **ripser** (Bauer 2021): C++ core, 100x faster than its predecessors. Computes persistence for VR complexes on up to 100k points in seconds.
  • **giotto-tda** (EPFL): sklearn-compatible library. `from gtda.homology import VietorisRipsPersistence`. Drop-in for feature extraction.
  • **gudhi** (INRIA): scientific reference, supports alpha complexes and cubical homology for images.
  • **TDA in PyTorch**: differentiable versions (Carriere et al. 2021), training neural networks with a persistence loss.
  • **Ayasdi -> Symphony AI**: the first commercial TDA product. B2B analytics for healthcare and enterprise.

Topology of neural network activations, not theory but a tool

How persistence is used to debug classifiers

Given: a ResNet trained on CIFAR-10. Accuracy 94%, but something is off in the confusion matrix: 'cat' and 'dog' are confused twice as often as expected. Diagnostic pipeline: 1. Push the validation set through the network, pull activations after avgpool. 2. Each class is a point cloud in R^512. 3. Compute VR persistent homology of each cloud. 4. b_1 for class 'cat' = 2, for 'dog' = 2, for 'airplane' = 0. Reading: 'cat' and 'dog' have two stable loops in latent space. That means the network encodes two modes (e.g. sitting/standing for cat) which then get confused through the geometry of the latent space. Practical takeaway: add data augmentation that separates the modes, or switch to a multi-prototype loss. Without TDA spotting this is nearly impossible. A similar analysis appears in Carlsson and Gabrielsson 2018 'Topological Approaches to Deep Learning' and Naitzat et al. 2020. In practice at Ayasdi and at several MLOps teams it is part of the standard debug checklist.

Takeaways

  • **Datasaurus dozen**: identical moments, different shapes. Formal proof that classical statistics loses geometry.
  • **Betti numbers $b_0, b_1, b_2$**: a catalog of holes of dimensions 0, 1, 2. Invariants under continuous deformation.
  • **Vietoris-Rips $\mathrm{VR}_\varepsilon$**: a simplicial complex where a $k$-simplex exists when every pairwise distance is $\leq \varepsilon$.
  • **Persistence**: instead of one $\varepsilon$, the whole family at once. Track feature birth and death. Long intervals are signal, short ones are noise.
  • **TDA in production**: ripser, giotto-tda, gudhi. pip install and a working pipeline. Genomics, neural network interpretability, time series.
  • **Limitations**: small $n$, very high dimension, hard interpretation. TDA augments the ML stack, it does not replace it.

Вопросы для размышления

  • Why does k-means on a ring give an absurd result while persistence gets it right? Is it the loss function or the problem statement?
  • Datasaurus shows that first and second order moments fail. Which higher-order moments could tell dino from circle? And how ML-friendly are they?
  • If a neural network's activations have $b_1 = 2$ for class 'cat', what does that say about the data augmentation you should apply?
  • VR needs $O(n^k)$ memory for $k$-simplices. On 10k points with $k = 3$ that is $10^{12}$. Which strategy (witness, sparse) looks right for production?
  • How does a 'cluster' in DBSCAN differ from a 'connected component' in VR at a well-chosen $\varepsilon$? Are they equivalent or not?

What this lesson unlocks

The foundation of TDA, and the entry point into the next topics.

  • Simplicial complexes — Going deeper: formal definitions, Cech versus Rips, alpha complexes.
  • Homology and Betti — Linear algebra of holes: chains, boundaries, quotient spaces.
  • Persistent homology — The core object of the course: persistence as a function of $\varepsilon$, barcodes, diagrams.
  • Mapper algorithm — A topological skeleton of data through a glued cover.
  • TDA in neural networks — Topology of activations as an interpretability tool.

Связанные уроки

  • calc-01-sequences
The shape of data: why topology sees what statistics misses

0

1

Sign In