Knowledge Graphs

GNN on Knowledge Graphs

KG embeddings from the previous lesson train vectors independently for each triple. But in real knowledge graphs patterns are distributed across the structure: the fact 'Pushkin is a poet' becomes evident from knowing he wrote poems and studied at the Lyceum. GNNs learn from exactly this kind of structural evidence.

**Microsoft GraphRAG:** GNN over KG for precise question answering without hallucination
**Biomedicine:** drug-protein interaction prediction via R-GCN on biological knowledge graphs
**Recommendation systems:** PinSage (Pinterest) - GNN on the user-pin graph at billion-edge scale

R-GCN: separate weights per relation type

Standard GCN aggregates neighbors through a single weight matrix W. In a knowledge graph edges have types: (Pushkin, wrote, Onegin), (Pushkin, bornIn, Moscow) are different relations. **R-GCN** (Schlichtkrull et al., 2018) introduces a separate matrix W_r per relation type.

**Parameter explosion:** with 500 relation types and d=200, that is 500 x 200 x 200 = 20 million parameters per layer. R-GCN solves this via **basis decomposition**: W_r = sum_b a_{rb} * V_b, where V_b are shared basis matrices and a_{rb} are scalar coefficients per relation.

**R-GCN on FB15k-237:** entity classification MRR 0.248. Advantage over TransE: R-GCN uses n-hop graph structure - direct h to t links and paths through intermediate entities.

Why does R-GCN use basis decomposition for its W_r matrices?

CompGCN: composition operators for entity+relation

R-GCN aggregates neighbors via W_r but does not update the relation vector - it does not learn in the context of specific entities. **CompGCN** (Vashishth et al., 2020) explicitly includes the relation vector in the message: before aggregation, entity h_u and relation r_k are combined via a composition operator.

**Why three matrices W_O, W_I, W_S?** Knowledge graphs are directed: (Pushkin, wrote, Onegin) and (Onegin, writtenBy, Pushkin) are different triples. CompGCN handles incoming and outgoing edges with different weights, and reverse edges via a special r_inv vector.

How does CompGCN fundamentally differ from R-GCN in message formation?

Message Passing Neural Networks: the general framework

R-GCN, CompGCN, GraphSAGE, GAT - all are special cases of the **MPNN (Message Passing Neural Network)** framework (Gilmer et al., 2017). It defines a common scheme: for each node v at each layer - compute messages from neighbors, aggregate them, update the node.

The number of layers L in MPNN determines the radius of information each node sees: after L layers, node v aggregates information from all nodes within distance L edges. For typical KG completion tasks, 2-3 layers is sufficient.

**Downstream tasks:** entity classification, link prediction (predict r in (h,?,t)), KG completion
**L=1:** direct neighbors only. L=2:** neighbors of neighbors - typical for KG
**Over-smoothing:** at large L all embeddings converge to a similar vector - nodes become indistinguishable

After 3 MPNN layers, node v has aggregated information from nodes at distance:

Aggregation strategies: mean, sum, attention

The AGG function is a critical design choice: it determines how messages from multiple neighbors are combined into a single vector. Different strategies produce different inductive biases and perform differently across tasks.

Aggregation	Formula	Property	Best for
Mean	sum(m_u) / \|N(v)\|	Invariant to degree	Homogeneous graphs
Sum	sum(m_u)	Sensitive to degree	Structural tasks
Max	max(m_u)	Key feature detection	Anomalies, outliers
Attention	sum(alpha_{uv} * m_u)	Weighted neighbors	Heterogeneous KG

**Theorem (Xu et al., 2019):** GNNs with sum aggregation are as expressive as the Weisfeiler-Leman (WL) graph isomorphism test - the theoretical maximum for message-passing GNNs. Mean and max aggregations are strictly less expressive (they cannot distinguish certain structures).

Node v has 10 neighbors with identical embeddings [1, 0, 0]. Sum returns [10, 0, 0], Mean returns [1, 0, 0]. When is Sum preferable?

GNNs for knowledge graphs

**R-GCN:** separate W_r per relation type + basis decomposition for parameter efficiency
**CompGCN:** message = phi(entity, relation), both updated; 3 operators (subtract/multiply/ccorr)
**MPNN:** general framework: Message → Aggregate → Update; L-hop coverage after L layers
**Aggregation:** sum is maximally expressive by WL test; attention is flexible for heterogeneous graphs

Вопросы для размышления

At what number of relation types does basis decomposition in R-GCN stop improving model quality, and why?
How does the over-smoothing problem in deep GNNs manifest specifically for knowledge graphs with billions of entities?
Why is attention aggregation useful for heterogeneous KGs but can underperform mean on regular homogeneous graphs?

Связанные уроки

ml-01