Knowledge Graphs

Knowledge Graph Completion

Freebase - the largest knowledge graph in history before Google - shut down in 2016 with 3 billion facts. And 95% of possible edges were empty. Not because the facts were unknown. Because humans physically could not keep up. KGC is the attempt to teach a graph to complete itself.

**Wikidata Link Suggestion** (2023): a BERT-based model predicts missing property-value pairs for articles, reducing load on 11K active editors
**Google Product KG**: CompGCN-like methods link products to categories and attributes that suppliers never explicitly declared
**OpenBioLink**: RotatE on a biomedical KG predicts drug-target interactions - potential therapeutic targets without clinical trials
**Microsoft Academic Graph**: automatic author-institution link discovery via rule mining - without scanning 250M publications manually

The Link Prediction Problem

Wikidata in 2023: 11,000 active editors, 1.8 billion statements. Yet 85% of potential links between entities are missing. Freebase before its shutdown in 2016 held 3 billion facts - and 95% of possible edges were empty. Not because the facts were unknown. Simply because no one got around to entering them.

This is not a data collection problem. It is mathematical reality: a knowledge graph with N entities and R relation types has up to $N^2 \times R$ potential edges. For Wikidata that is roughly 100 quintillion pairs. Human hands cannot reach that far.

**Knowledge Graph Completion (KGC)** is the automatic prediction of missing edges based on the existing graph structure. Formally: given a graph $\mathcal{G} = (\mathcal{E}, \mathcal{R}, \mathcal{T})$ of triples $(h, r, t)$, answer the query $(h, r, ?)$ or $(?, r, t)$ by ranking all candidate entities.

Google Knowledge Graph uses KGC to fill information panels in Search: if the graph knows Elon Musk is CEO of Tesla and Tesla is headquartered in Austin, KGC can infer Musk's connection to Austin. Microsoft Academic Graph applied it to auto-complete author-institution links. Amazon Product Graph predicts "also fits" relationships between items.

Approaches fall into three classes. **Translational models** (TransE, RotatE) - geometry in embedding space. **Rule mining** (AMIE) - automatic extraction of logical rules. **GNN methods** (CompGCN, KGCN) - neural networks operating on graph structure. Each class captures different patterns.

The standard metric is **Mean Reciprocal Rank (MRR)**: for each query the model ranks all candidate entities; the reciprocal rank of the correct answer is averaged over all queries. MRR = 1 means the correct answer is always ranked first. **Hits@K** measures how often the correct answer appears in the top-K. On FB15k-237 the best models reach MRR 0.35-0.48 and Hits@10 around 0.53-0.63.

Wikidata has 100M entities and 10K relation types. How many potential edges exist in the graph?

TransE and RotatE: The Geometry of Relations

2013. Bordes et al. publish TransE at NIPS with a disarmingly simple idea. If the triple $(h, r, t)$ is true, then in embedding space the following should hold approximately: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$. The relation is a translation vector.

The same insight as word2vec: "king - man + woman = queen" - but now for entities. "Einstein + bornIn = Ulm" in embedding space should yield a vector close to Ulm. The model is trained contrastively: true triples receive high scores, randomly corrupted triples receive low scores.

TransE is elegant but has a fundamental flaw: a relation is one translation vector, shared across all entity pairs. This works for 1-to-1 relations. But "bornIn" is 1-to-many: thousands of people were born in Moscow. TransE tries to collapse all their embeddings toward one point. Moscow = Pushkin = Tolstoy in embedding space? No.

RotatE (Sun et al., 2019) solves this more elegantly. Instead of translation in real space - **rotation in complex space**. Each relation is a rotation $e^{i\theta_r}$ per coordinate:

The Hadamard product $\circ$ (element-wise multiplication in $\mathbb{C}^d$) encodes symmetry, anti-symmetry, and inversion. "Spouse" is symmetric: rotation by $\pi$ twice = identity. "parentOf" is anti-symmetric: rotation by angle without inverse. KGBERT (2022) achieves MRR 0.479 on FB15k-237 - against 0.279 for TransE.

**Relation patterns and their encoding:** - **Symmetry** (A married_to B => B married_to A): RotatE encodes as rotation by $\pi$ - **Anti-symmetry** (A parentOf B => B NOT parentOf A): rotation without inverse - **Inversion** (A bornIn City => City birthplaceOf A): $\theta_{r_1} = -\theta_{r_2}$ - **Composition** (A parentOf B, B parentOf C => A grandparentOf C): angle addition TransE cannot encode symmetry (h + r = t and t + r = h implies r = 0). RotatE encodes all four patterns.

In production, embedding approaches power the Wikidata Query Service link suggestion tool, knowledge enrichment in Alexa, and Google's KELM (Knowledge-Enhanced Language Model). Typical embedding dimensions range from 200 to 1000; standard benchmarks are FB15k-237 (14,541 entities, 237 relations) and WN18RR (40,943 entities, 11 relations).

The relation 'spouse' is symmetric: if A is married to B, then B is married to A. Why does TransE model symmetric relations poorly?

AMIE and GNN: Rules and Neural Networks on the Graph

Embedding models predict links but do not explain why. TransE says there is a relation R between X and Y - so what? One cannot audit it. One cannot trust it in medicine or law. AMIE (Association Rule Mining under Incomplete Evidence) takes a different route: it mines logical rules directly from the graph structure.

The idea: if the pattern $\text{livesIn}(X, Y) \land \text{capitalOf}(Y, Z) \Rightarrow \text{nationality}(X, Z)$ appears frequently, it is a rule. AMIE computes **support** (how many times the rule head is true) and **confidence** (fraction of times the body is true and the head holds too).

AMIE is interpretable: every prediction comes with an explanatory rule. The limitation: rules are linear (triple chains) and miss complex structural patterns. They also only fire if the pattern already appears somewhere in the graph.

CompGCN (Vashishth et al., 2020) merges the best of both worlds. It is a Graph Neural Network where message passing accounts for relation types. Each entity updates its representation from neighbors - weighted by which edge type connects them.

Here $\phi$ is a composition operator combining a neighbor embedding with an edge embedding (addition, multiplication, or concatenation). $W_\lambda$ is a different weight matrix for forward, inverse, and self-loop edges. After L such aggregation layers, an entity representation captures its L-hop neighborhood context.

CompGCN reaches MRR 0.355 on FB15k-237 - higher than TransE (0.279) and comparable to RotatE (0.338). The key advantage: GNN weights are shared across entities. A new entity added to the graph is handled by the existing weights through a few aggregation steps. Embedding lookup tables are not.

**KGC approach comparison:** | Method | MRR (FB15k-237) | Interpretable | Incremental | |---|---|---|---| | TransE | 0.279 | no | no | | RotatE | 0.338 | no | no | | AMIE (rules) | ~0.20 | yes | partial | | CompGCN | 0.355 | no | yes | | KGBERT | 0.479 | partial | yes | KGBERT is fine-tuned BERT on textual descriptions of entities and relations. The highest MRR scores belong to models combining structural embeddings with language context.

In practice, approaches are combined. Wikidata Mismatch Finder uses rule-based methods (AMIE style) to detect contradictions. Google's Product Knowledge Graph applies GNN methods for item-category links. OpenBioLink (biomedical KG) uses RotatE because biological relations tend to be symmetric and have clear semantics.

KGC is just path search: if there is a path from A to B in the graph, the link exists

KGC predicts links absent from the graph by generalizing patterns across all embeddings or rules

Path search is graph traversal, not completion. TransE and RotatE predict links even between entities with no shared neighbors, by generalizing the structure of the entire graph through embedding space. That is precisely why one can predict 'nationality' for a person never explicitly linked to a country in the graph.

Key Ideas

**The KGC problem**: the largest knowledge graphs are 5-15% complete - KGC predicts missing edges automatically
**TransE**: $\mathbf{h} + \mathbf{r} \approx \mathbf{t}$ - relation as translation; elegant but breaks on symmetric and 1-to-many relations
**RotatE**: relation as rotation in $\mathbb{C}^d$; encodes symmetry, anti-symmetry, inversion, and composition - MRR 0.338 vs 0.279 for TransE
**AMIE**: rule mining with confidence and support - produces interpretable rules like livesIn + capitalOf => nationality
**CompGCN**: GNN with relation-typed aggregation - handles new entities incrementally; static embedding tables do not

Вопросы для размышления

TransE breaks on symmetric relations, RotatE fixes it via rotation. What other relation patterns might break geometric approaches entirely?
AMIE produces interpretable rules while GNN methods do not. In which domains (medicine, law, finance) does this matter most and why?
If 10 million new entities are added to the graph with no edges - what will embedding models predict and what will GNN-based methods do?

Связанные уроки

aie-41-knowledge-graphs — KG completion is the core of production KG systems in AI Engineering
aie-63-graphrag — GraphRAG uses link prediction to enrich retrieval context
ml-35-word-embeddings — TransE is word2vec for graph edges: the same translation idea in embedding space
ml-31-transformers — CompGCN and GNN methods use the same message-passing aggregation as transformers - but on graphs
ds-16-graphs-intro — Basic graph theory is a prerequisite for understanding KG embeddings
ml-01