Knowledge Graphs
GNN on Knowledge Graphs
KG embeddings from the previous lesson train vectors independently for each triple. But in real knowledge graphs patterns are distributed across the structure: the fact 'Pushkin is a poet' becomes evident from knowing he wrote poems and studied at the Lyceum. GNNs learn from exactly this kind of structural evidence.
- **Microsoft GraphRAG:** GNN over KG for precise question answering without hallucination
- **Biomedicine:** drug-protein interaction prediction via R-GCN on biological knowledge graphs
- **Recommendation systems:** PinSage (Pinterest) - GNN on the user-pin graph at billion-edge scale
R-GCN: separate weights per relation type
Standard GCN aggregates neighbors through a single weight matrix W. In a knowledge graph edges have types: (Pushkin, wrote, Onegin), (Pushkin, bornIn, Moscow) are different relations. **R-GCN** (Schlichtkrull et al., 2018) introduces a separate matrix W_r per relation type.
**Parameter explosion:** with 500 relation types and d=200, that is 500 x 200 x 200 = 20 million parameters per layer. R-GCN solves this via **basis decomposition**: W_r = sum_b a_{rb} * V_b, where V_b are shared basis matrices and a_{rb} are scalar coefficients per relation.
**R-GCN on FB15k-237:** entity classification MRR 0.248. Advantage over TransE: R-GCN uses n-hop graph structure - direct h to t links and paths through intermediate entities.
Why does R-GCN use basis decomposition for its W_r matrices?
CompGCN: composition operators for entity+relation
R-GCN aggregates neighbors via W_r but does not update the relation vector - it does not learn in the context of specific entities. **CompGCN** (Vashishth et al., 2020) explicitly includes the relation vector in the message: before aggregation, entity h_u and relation r_k are combined via a composition operator.
**Why three matrices W_O, W_I, W_S?** Knowledge graphs are directed: (Pushkin, wrote, Onegin) and (Onegin, writtenBy, Pushkin) are different triples. CompGCN handles incoming and outgoing edges with different weights, and reverse edges via a special r_inv vector.
How does CompGCN fundamentally differ from R-GCN in message formation?
Message Passing Neural Networks: the general framework
R-GCN, CompGCN, GraphSAGE, GAT - all are special cases of the **MPNN (Message Passing Neural Network)** framework (Gilmer et al., 2017). It defines a common scheme: for each node v at each layer - compute messages from neighbors, aggregate them, update the node.
The number of layers L in MPNN determines the radius of information each node sees: after L layers, node v aggregates information from all nodes within distance L edges. For typical KG completion tasks, 2-3 layers is sufficient.
- **Downstream tasks:** entity classification, link prediction (predict r in (h,?,t)), KG completion
- **L=1:** direct neighbors only. L=2:** neighbors of neighbors - typical for KG
- **Over-smoothing:** at large L all embeddings converge to a similar vector - nodes become indistinguishable
After 3 MPNN layers, node v has aggregated information from nodes at distance:
Aggregation strategies: mean, sum, attention
The AGG function is a critical design choice: it determines how messages from multiple neighbors are combined into a single vector. Different strategies produce different inductive biases and perform differently across tasks.
| Aggregation | Formula | Property | Best for |
|---|---|---|---|
| Mean | sum(m_u) / |N(v)| | Invariant to degree | Homogeneous graphs |
| Sum | sum(m_u) | Sensitive to degree | Structural tasks |
| Max | max(m_u) | Key feature detection | Anomalies, outliers |
| Attention | sum(alpha_{uv} * m_u) | Weighted neighbors | Heterogeneous KG |
**Theorem (Xu et al., 2019):** GNNs with sum aggregation are as expressive as the Weisfeiler-Leman (WL) graph isomorphism test - the theoretical maximum for message-passing GNNs. Mean and max aggregations are strictly less expressive (they cannot distinguish certain structures).
Node v has 10 neighbors with identical embeddings [1, 0, 0]. Sum returns [10, 0, 0], Mean returns [1, 0, 0]. When is Sum preferable?
GNNs for knowledge graphs
- **R-GCN:** separate W_r per relation type + basis decomposition for parameter efficiency
- **CompGCN:** message = phi(entity, relation), both updated; 3 operators (subtract/multiply/ccorr)
- **MPNN:** general framework: Message → Aggregate → Update; L-hop coverage after L layers
- **Aggregation:** sum is maximally expressive by WL test; attention is flexible for heterogeneous graphs
Related topics
GNNs on KGs connect embedding methods with graph-structured learning.
- KG Embeddings: TransE, RotatE, ComplEx — Composition operators phi in CompGCN directly inherit scoring functions from the previous lesson
- Graph Neural Networks: core concepts — R-GCN and CompGCN are specializations of GCN for multi-relational graphs
Вопросы для размышления
- At what number of relation types does basis decomposition in R-GCN stop improving model quality, and why?
- How does the over-smoothing problem in deep GNNs manifest specifically for knowledge graphs with billions of entities?
- Why is attention aggregation useful for heterogeneous KGs but can underperform mean on regular homogeneous graphs?