Deep Learning
Graph Neural Networks
Most real-world data is not a grid (images) or a sequence (text) - it is a graph. Molecules are graphs of atoms and bonds. Social networks are graphs of users and connections. Protein interactions are graphs. Knowledge bases are graphs. Citation networks are graphs. For decades, the only way to apply deep learning to these structures was to hand-engineer graph features. GNNs changed this: they learn directly on graph structure, and today they power drug discovery at Pfizer, fraud detection at PayPal, recommendation at Pinterest, and protein structure prediction at DeepMind.
- **DeepMind AlphaFold 2** uses an Evoformer network with attention over both sequence and structure representations - a GNN-inspired architecture that solved protein structure prediction (CASP14: 92.4 GDT accuracy), earning the 2024 Nobel Prize in Chemistry.
- **Pinterest Recommendations** uses PinSage (graph convolution over a 3-billion-node pin/board graph) to generate recommendations, improving engagement by 30% over previous matrix-factorization approaches and serving 450 million users.
- **PayPal Fraud Detection** applies GNN-based models to transaction graphs, detecting coordinated fraud rings that individual transaction classifiers miss - identifying multi-account fraud patterns worth hundreds of millions of dollars annually.
Marco Gori, Franco Scarselli, and the original Graph Neural Network
The term Graph Neural Network was introduced by Marco Gori and colleagues in 2005, then formalized by Franco Scarselli, Marco Gori, and co-authors in their 2009 paper 'The Graph Neural Network Model'. Their model propagated information across nodes until reaching a fixed point, which made training slow and unstable. The field stayed niche until 2016, when Thomas Kipf and Max Welling proposed the Graph Convolutional Network (GCN) for semi-supervised classification, simplifying spectral convolution into a cheap layerwise update. GCN made GNNs practical and triggered the explosion of graph deep learning.
Предварительные знания
- Backpropagation and training neural networks
- Attention mechanism (GAT and Graph Transformers reuse it)
- Basic linear algebra: adjacency and degree matrices, matrix multiplication
Message Passing Framework
Graph Neural Networks (GNNs) learn representations of nodes and edges in graphs by iteratively aggregating neighborhood information. The message passing framework (Gilmer et al. 2017) formalizes all major GNN variants: at each layer, node v aggregates 'messages' from its neighbors, combines them with its own representation, and updates its embedding. After K layers, each node's embedding encodes information from its K-hop neighborhood.
Aggregation function choice matters significantly: sum aggregation is more powerful than mean (can distinguish neighborhoods with different node counts); max aggregation extracts the most prominent feature. GraphSAGE found mean aggregation generalizes better to unseen graphs; GIN uses sum aggregation to achieve maximal expressiveness (as powerful as the Weisfeiler-Lehman graph isomorphism test).
After K message passing layers, what information does a node's embedding capture?
Graph Convolutional Networks
GCN (Kipf & Welling, 2017) simplifies spectral graph convolution to a computationally efficient message passing rule: each node aggregates features from its neighbors with degree-normalized weights. The propagation rule is H' = D^(-1/2) * A_hat * D^(-1/2) * H * W, where A_hat = A + I (adjacency with self-loops) and D is the degree matrix. This symmetric normalization prevents high-degree nodes from dominating aggregation.
GCN's weakness: it assigns equal weight to all neighbors, regardless of their relevance. A node connected to 1000 low-quality neighbors and 1 highly relevant neighbor gets a diluted representation. This motivates attention-based aggregation (GAT).
Why does GCN add self-loops (A_hat = A + I) in its propagation rule?
Graph Attention Networks
GAT (Velickovic et al., 2018) replaces fixed degree normalization with learned attention coefficients: each node learns which neighbors are most important for its update. The attention coefficient alpha_ij between nodes i and j is computed by a learnable linear transformation of their concatenated features, then normalized via softmax over all neighbors of i.
Multi-head attention (GAT uses 8 heads) improves stability and expressiveness. Each head learns different attention patterns; outputs are concatenated (for hidden layers) or averaged (for the final layer). GAT achieves 83.0% on Cora (vs. GCN's 81.5%) and 72.5% on Citeseer (vs. 70.3%), with larger gains on heterophilic graphs where neighbors have different labels.
GATv2 (Brody et al. 2021) fixed a subtle flaw in the original GAT: the static attention mechanism computes the same attention coefficient regardless of the query node's features. GATv2 adds a nonlinearity before the attention weight computation, making it 'dynamic' - the relevant neighbors depend on which node is asking. GATv2 improves over GAT by 2-5% on several benchmarks.
What is the key advantage of GAT over GCN?
Graph Transformers
Graph Transformers extend the Transformer architecture to graph-structured data. Unlike GNNs that aggregate only adjacent neighbors, Graph Transformers allow each node to attend to all other nodes in the graph, making them more expressive but O(n^2) in node count. Graphormer (Microsoft, 2021) won the OGB-LSC quantum chemistry prediction challenge by encoding graph structure (shortest path distances, degree centrality) as positional and structural encodings.
GPS (General, Powerful, Scalable) framework (Rampavsek et al. 2022) combines local message passing (GNN) with global attention (Transformer) at each layer, maintaining O(n) local complexity while adding O(n^2) global attention as an optional component. This hybrid achieves SOTA on molecular benchmarks (ZINC, PCQM4Mv2) while remaining practical for medium-scale graphs.
Positional encoding in Graph Transformers is a fundamental challenge: unlike sequences (sinusoidal PE) or images (grid PE), graphs have no canonical node ordering. Solutions: Laplacian eigenvectors (spectral PE), random walk statistics (RWPE), or shortest-path distances - each capturing different structural properties.
GNNs can model any graph property given enough layers
Standard message-passing GNNs are bounded in expressive power by the Weisfeiler-Lehman (1-WL) graph isomorphism test - they cannot distinguish certain non-isomorphic graphs, limiting what structural properties they can learn
Two nodes with identical K-hop neighborhood structures receive identical embeddings regardless of the global graph structure - 1-WL-indistinguishable graphs are also GNN-indistinguishable
Why do Graph Transformers with full attention outperform message-passing GNNs on some tasks?
Key Ideas
- **Message passing** is the unifying framework: each layer aggregates neighborhood features (sum/mean/max/attention-weighted), expanding the receptive field by one hop per layer.
- **GCN** uses fixed degree-normalized aggregation; **GAT** learns attention weights per neighbor pair - GAT wins on heterophilic graphs where GCN's equal-neighbor assumption fails.
- **Graph Transformers** (Graphormer, GPS) add full or sparse global attention on top of local message passing, capturing long-range dependencies critical for molecular and knowledge graph tasks.
Related Topics
GNNs connect to Transformers and specialized application domains:
- Transformers — Graph Attention Networks (GAT) are architecturally similar to Transformers; Graph Transformers directly extend Transformer self-attention to graph-structured inputs
- Self-Supervised Learning — Graph contrastive learning (GraphCL, GRACE) applies SSL objectives to GNNs, enabling pretraining on large unlabeled molecular and social graphs
Вопросы для размышления
- A drug discovery team wants to predict molecular toxicity for 10 million compounds, with labels available for only 50,000. They have the full molecular graph (atoms and bonds). Design the ML pipeline including architecture, pretraining strategy, and evaluation.
- Social network fraud detection using GNNs faces adversarial attackers who modify their account graph structure to avoid detection. What are the most effective defenses?
- When does a GNN with 5 layers fail to learn useful representations, and what is the specific failure mode? How does it manifest in node classification performance?
Связанные уроки
- dl-06 — GAT applies scaled dot-product attention to graph edges
- dl-07 — ViT and Graph Transformers share positional encoding ideas
- dl-17 — GraphMAE and DGI pretrain GNNs without node labels
- ds-16-graphs-intro — Graph as a data structure of nodes and edges
- alg-12-bfs — Message passing spreads like breadth-first traversal
- la-35-spectral-graph — Spectral GNNs use the graph Laplacian eigenbasis
- la-01-vectors-intro