Topology
TDA in Neural Networks: Activation Topology and Interpretability
Цели урока
- Understand how neural network activations form a point cloud for TDA
- Link Betti numbers of activations with class count and representation geometry
- Study loss landscape topology and its link to generalization
- Master decision boundary analysis via persistence
- Apply TDA for interpretability and pruning
Предварительные знания
- Vietoris-Rips complex and persistent homology
- Basic understanding of neural networks (layers, activations, loss)
- Linear algebra and gradient descent
- Mapper algorithm for topology visualization
A network on MNIST hit 99.7%. What does it represent inside? TDA saw it: every class is a loop in activations. The black box turned gray.
- **Computer vision**: CNN interpretation via convolutional activation topology (Rieck et al. 2019)
- **NLP**: understanding BERT and GPT through Mapper on attention patterns (Rathore et al. 2019)
- **Model compression**: topological pruning of neural networks via filter persistence
- **Adversarial defense**: topological characterization of robustness regions of the input space
TDA meets Deep Learning
The first rigorous application of TDA to neural network understanding was by Rieck et al. (2019) in the paper Neural Persistence: A Complexity Measure for Deep Neural Networks Using Algebraic Topology. This paper introduced neural persistence as a complexity measure. Gabrielsson and Carlsson (2019) introduced topological regularization for neural networks via differentiable persistence. The connection between loss landscape topology and generalization was explored in a series of papers in 2018-2020. Neural Collapse (Papyan et al., 2020) revealed a universal topological structure of last-layer features. Today topological machine learning is a recognized subfield with dedicated workshops at NeurIPS and ICML.
Topology of Activation Spaces
A convolutional neural network for digit classification achieves 99.7% accuracy on MNIST. But what exactly does its internal representation look like? In 2017 Clique Community Persistence was applied to network activation patterns. The result: the network learned to separate digit classes by creating topological holes - each class corresponds to a loop in the activation manifold. When the network fails on adversarial examples, this topology breaks down. Activations have measurable, interpretable topology.
Activation of layer l on input x is a vector a_l(x) in R^{n_l}. For a dataset X = {x_1, ..., x_N} the activations form a point cloud in R^{n_l}. TDA applies - Vietoris-Rips persistence, Mapper, Betti numbers.
MNIST through TDA
What persistence diagrams of a deep network reveal
A 5-layer network trained on MNIST. At input layer: one large H_0 component (pixel images densely packed). At final layer before softmax: exactly 10 H_0 components corresponding to the 10 digits. In intermediate layers there is gradual transition - the network learns to separate classes through topological transformations. H_1 bars correspond to continuous within-class variations (digit slant, stroke thickness).
Comparing persistence diagrams of two layers via bottleneck distance shows how informationally different they are. If adjacent layers have nearly identical PDs - they are redundant, and one can be removed. This is a form of topological model compression.
ML applications: representation diagnostics - comparing persistence diagrams determines whether different layers really learn different things; topological model compression - removing layers with similar topologies; detecting memorization vs generalization through PD complexity (memorized models have more complex PDs).
What does Betti number beta_0 of the final layer activation cloud reveal?
In a well-trained classification network, beta_0 at later layers approaches the class count: each class is a separate component.
Loss Landscape Topology
The loss function L: R^p -> R (where p is the parameter count) has topology. For neural networks, the loss landscape contains many local minima and saddle points. Sublevel sets {theta : L(theta) <= c} change topology as c decreases. This topology is directly tied to network trainability and generalization.
Neural Collapse (Papyan et al., 2020): at convergence, last-layer features collapse to simplex ETF - a symmetric configuration known as equiangular tight frame. This is topologically a very specific structure - maximum symmetry with minimum dimensionality.
Flat vs sharp minima
Topological interpretation of generalization
Hochreiter and Schmidhuber (1997) and later Keskar (2017): flat minima generalize better than sharp ones. Topologically: a flat minimum has a sublevel set with simple topology (topological ball) near the minimum. A sharp minimum has complex topology (narrow valley with many saddle points). Random slice through the loss landscape visualizes this: compute persistence of a 2D slice.
Full persistence computation of the loss landscape for a billion-parameter network is impossible. Random slices (2D or 3D projections) or special parameterizations (linear interpolation between minima of different runs) are used instead.
Simple topology of a minimum neighborhood relates to better PAC-Bayes generalization bounds: small effective measure of the minimum neighborhood means smaller model complexity per PAC-Bayes - hence better generalization.
ML applications: architecture comparison through loss landscape topology - flatter (topologically simpler) minima yield better generalization; mode connectivity (Garipov et al., 2018) - neural network minima are often connected by constant-loss paths, reflected in sublevel set topology; topology-aware optimizers penalizing complex Hessian configurations.
What is the link between loss minimum topology and generalization?
Flat minima have simple local topology and link to better PAC-Bayes bounds, consistent with empirical observations.
Betti Numbers of Decision Boundaries
For a trained binary classifier with decision boundary B = {x : f(x) = 0}, the topology of B reflects learned function complexity. H_0(B): number of boundary connected components (how many separate regions the classifier creates). H_1(B): loops in the boundary (cyclic decision patterns). High Betti numbers mean a complex, potentially overfit boundary.
A linear classifier has a hyperplane boundary - contractible, all Betti numbers are 0 except beta_0 = 1. A neural network on XOR has a more complex boundary with several components. Deep networks can create boundaries with thousands of components and loops.
Decision boundary complexity
How Betti numbers reflect overfitting
Trained an MLP on 2D data with Gaussian noise. At train_acc 95% (good generalization): boundary has beta_0 = 2 (two separate regions), beta_1 = 1 (one closed loop). At train_acc 100% (overfitting): the boundary becomes fractal, beta_0 = 47 (47 tiny class islands), beta_1 = 89. Boundary topology quantitatively shows how much the model overfit.
Gabrielsson and Carlsson (2019) proposed TopoReg: add a penalty for high decision boundary Betti numbers to the loss function. In practice this requires a differentiable persistence approximation via soft assignment. Result: topologically regularized networks overfit less on small datasets.
Exact computation of decision boundary Betti numbers in high-dimensional input space is expensive. Standard trick: projection or sampling - select a 2D or 3D subspace (via PCA or important features) and compute boundary persistence there. This is an approximate but computationally tractable metric.
ML applications: topological complexity regularization (TopoReg) - penalize high beta_0 and beta_1 of boundary during training; early stopping by topological complexity growth - if boundary Betti numbers start growing, the network is overfitting; architecture comparison through decision boundary complexity on identical datasets.
What are the Betti numbers of a linear classifier decision boundary?
A hyperplane is contractible and connected, so it has exactly one component and no loops.
TDA for Interpretability: What Networks Learn
Mapper applied to the activation space of later layers reveals how the network organizes representations. Rathore et al. (2019) studied transformers via TDA - mapper graph showed activation clusters corresponding to syntactic roles of tokens. This is a tool for understanding exactly what a network encoded in its internal states.
Interpretability applications: persistent homology determines which features the network finds informative (long H_0 bars - stable clusters in activations); mapper graph shows hierarchical representation structure (classes, subclasses, anomalies); H_1 loops often correspond to continuous variation factors (orientation in vision, tonality in audio).
Transformer attention via TDA
Topological analysis of attention patterns
Take BERT, run 1000 sentences, collect attention matrices of each layer. Apply mapper to attention vectors. Result: clusters correspond to syntactic roles (subject, verb, object); H_1 loops - to cyclic constructions (relative clauses, parenthetical asides). This gives interpretable visualization of what are usually considered black-box models.
Persistent homology for neural network pruning: filters with short persistence in activation space are removal candidates. They do not contribute topologically significant information to representations. The method gives 30-50% compression with no accuracy loss on standard benchmarks.
An adversarial example is an input moving a point from one topological region to another without crossing the real topological separator (the decision boundary is crossed without the real class changing). TDA for adversarial robustness: characterizing density of topologically stable regions of the input space - input-space-aware certification.
ML applications: TDA-based explanation of failure modes - identify the topological region of input space where the model errs; topological data augmentation - generate new training points preserving topological structure (useful for rare classes); neural architecture search guided by TDA - select architecture whose prediction function has correct topological complexity for the task.
What does Mapper applied to transformer attention vectors reveal?
Mapper creates a graph whose nodes are clusters of similar activation patterns and edges are topological connections. On attention this yields interpretable syntactic representations.
Where this leads
TDA in neural networks bridges classical topology with modern ML. It opens the path to interpretable models, theoretically grounded regularization and principled neural architecture search.
- Witness complexes — Related topic
- TDA for time series — Related topic
- Persistent homology — Related topic
- Vietoris-Rips — Related topic
Key ideas
- Neural network activations form a point cloud amenable to TDA
- Betti number beta_0 of later layers approximately equals class count
- H_1 loops in activations correspond to continuous variation factors
- Loss landscape has sublevel set topology changing with level c
- Flat (topologically simple) minima generalize better than sharp ones
- Decision boundary with high Betti numbers indicates overfitting
- Topological regularization (TopoReg) penalizes boundary complexity during training
- Mapper on activation patterns gives interpretable visualization of what the network learns
Вопросы для размышления
- Why do Betti numbers of final-layer activations approach class count only in a well-trained network?
- What topological properties of loss landscape make an architecture trainable?
- How to avoid combinatorial explosion when computing the decision boundary in high dimensions?
- What structure does Neural Collapse impose and why is it optimal?
- Can differentiable persistence replace cross-entropy for classification?