Knowledge Graphs
KG at Scale: Google, Meta
May 16, 2012, a blog post by Amit Singhal at Google titled 'Introducing the Knowledge Graph: things, not strings'. For the query 'Marie Curie', Singhal shows: to the right of the classical web SERP appears a panel with her photo, life dates, area of research, husband, daughter, Nobel prizes. It is the first time web search answers not 'here are ten links' but 'here is the fact itself'. Behind the scenes - 18 billion facts about 570 million entities. By 2024 the Knowledge Graph has grown tenfold: ~5 billion entities, ~500 billion facts. It is not one big Wikidata - it is automatic extraction from billions of web pages combined with an editorial process for the top 10,000 entities. Between 2010 Freebase (12M entities) and 2024 Google KG (5B) lies a 400x gap, crossed by engineering decisions that are now the industrial KG standard.
- **Google Knowledge Graph** - ~5B entities in 2024, drives knowledge panels in Search, Google Assistant answers, context for Gemini; an integral part of Google products.
- **Meta (Facebook) Knowledge Graph** - optimized for social discovery: people, pages, events, interests; used in News Feed ranking and Marketplace.
- **Wikidata** - open KG with 100M+ entities, foundational for Wikipedia across 350+ languages; used as seed for commercial KGs and in academic databases.
Freebase: Collaborative KG Before Google
In May 2010 Google buys Metaweb for $40-50 million - and inherits Freebase, a collaborative knowledge graph with 12 million entities and 125 million triple statements. Freebase launched in 2007 as 'Wikipedia for structured data': users edited facts by hand through MQL (Metaweb Query Language). Under the hood was a graph database on a proprietary engine, with typing via Compound Value Types (CVT) for relations carrying metadata (an actor playing a role in a film links three entities through one CVT). Google used Freebase as the foundation for the Knowledge Graph launched in May 2012 with the promise 'things, not strings'. In 2014-2016 Google shut Freebase down, migrating data into Wikidata and its own private KG. The Freebase story is a lesson: even the best crowdsourcing hits coverage limits (~12M entities), and Google needed billions.
MID (Machine ID) is the unique entity identifier in Freebase, format /m/0abc123 (base32). It survives in Wikidata as 'External ID' until 2024 and is used for cross-referencing datasets. Each Wikipedia page that carried a Freebase entity still bears this MID.
The main reason Google shut down Freebase and switched to an internal KG?
Wikidata: An Open Structured World
Wikidata, launched by the Wikimedia Foundation in October 2012, aimed to be the structured backend of Wikipedia across all languages. Every Wikipedia article in any language (English, Russian, Chinese) points to a single Wikidata entity with a unique Q-number (Berlin = Q64, Albert Einstein = Q937). Properties carry P-numbers (date of birth = P569, height = P2048). By 2024: 100M+ entities, 1.5B triples, ~25,000 active editors per month. Access through a SPARQL endpoint (query.wikidata.org), regular 200 GB RDF dumps. Wikidata is a public good of the modern web: used in Google Knowledge Graph, Apple Siri, Amazon Alexa, OpenStreetMap, and academic databases. A unique trait is language-neutral identity: the entity 'Moscow' is one Q-node with labels in 350+ languages, which is critical for multilingual systems.
Provenance in Wikidata: each statement can carry multiple sources (references), temporal qualifiers ('start time', 'end time'), and rank (preferred / normal / deprecated). This supports conflicting statements and temporal dynamics - the population of Berlin in 1990 vs 2024.
Which architectural decision makes Wikidata convenient for multilingual systems?
Knowledge Vault: Automated Knowledge Extraction
In 2014 Google published 'Knowledge Vault: A Web-Scale Approach to Probabilistic Knowledge Fusion' (Dong et al., KDD 2014) - an approach to automatically extracting billions of facts from the open web. The idea: instead of crowdsourcing, web pages already carry structured signals (HTML tables, microdata, infoboxes), and free text can be parsed via NLP. Knowledge Vault combines four sources: TXT (extracted relations from free text), DOM (structured HTML), TBL (HTML tables), MDA (Microdata schema.org). Each fact gets a confidence score, and fusion retains the high-confidence ones. The result in 2014: 1.6 billion facts about ~50M entities - 30x larger than Freebase. The modern Google Knowledge Graph descends from this approach: automatic extraction, plus an editorial team for top entities, plus seeding from Wikidata.
Confidence calibration: extraction classifiers emit raw scores that must be calibrated. Knowledge Vault uses Platt scaling - logistic regression over raw scores. Without calibration, a 0.9 threshold can mean very different precisions across extractors; afterwards, 0.9 universally means ~90% precision.
Why does Knowledge Vault use Bayesian fusion (combining multiple sources) rather than just trusting the most confident extractor?
Scaling to Billions of Entities
The modern Google Knowledge Graph - ~5 billion entities, 500 billion facts. Facebook (Meta) built an analogous graph for social and interest entities. At this scale the simple techniques that work in Wikidata (a single SPARQL endpoint, a monolithic triple store) collapse. Industrial KG solutions: 1) sharding by entity domain (people/places/companies) or by hash(entity_id); 2) specialized query layers - graph traversal via Pregel/Giraph for multi-hop, low-latency lookup via KV-store for single-hop, full-text via Lucene for description search; 3) eventual consistency between shards - a fact may appear in one shard before others; 4) hot/cold tiering - the 1% most-queried entities in RAM, the rest on SSD.
Triple-store query plan: even a simple SPARQL query with 3 JOINs on 500B triples produces ~5M intermediate rows. Modern engines (Virtuoso, Stardog) use join reordering driven by statistics, parallel scans over partitions, and materialized views for the top-100 query patterns. Without that, latency grows from seconds to minutes.
Google Knowledge Graph is public Wikidata with a nicer UI
Google KG is a private graph of ~5B entities that uses Wikidata as one seed, but is primarily built from automatic web extraction (Knowledge Vault), internal signals, and an editorial team
Wikidata holds ~100M entities; Google KG is roughly 50x larger. Wikipedia infoboxes + Wikidata cover celebrities and notable places, but not local businesses, products, or long-tail video games. Google invested in automatic extraction precisely for that long tail.
Why do industrial KGs use several query engines (KV, Pregel, Lucene, SPARQL) rather than a single unified engine?
Key ideas
- **Freebase** was the first collaborative KG (2007-2016), 12M entities, MQL, bought by Google in 2010; shut down after migration into Wikidata.
- **Wikidata** is an open language-neutral KG, 100M+ entities, SPARQL endpoint, foundation for multilingual systems; provenance via references and qualifiers.
- **Knowledge Vault** performs automatic fact extraction from the web (TXT + DOM + TBL + MDA) with Bayesian fusion confidence scores; the basis of industrial KGs at billion-scale.
- **Scaling** uses domain sharding, specialized query engines (KV, Pregel, Lucene, SPARQL), and hot/cold tiering; a universal engine loses to specialists by 10-100x.
Related topics
Industrial KGs weave the previously studied concepts into production infrastructure:
- KG + LLM: RAG and Grounding — Google KG is the grounding source for Gemini; the 5B-entity scale makes graphRAG viable for all knowledge domains, not only narrow ones
- Entity Linking and Resolution — Knowledge Vault over HTML and text requires reliable entity linking; with bad alignments the aggregated confidence loses meaning
Вопросы для размышления
- Back to the hook: between Freebase (12M entities) and the modern Google KG (5B) lies a 400x gap. Which fraction came from automatic extraction versus acquisitions/imports of other datasets?
- Knowledge Vault uses Bayesian fusion with an independence-of-sources assumption. When is that assumption violated (e.g., two pages quoting the same Wikipedia article)?
- When designing a KG for a startup with a $1M budget and 100M facts, which Google-scale decisions remain reasonable, and which become overkill?
Связанные уроки
- kg-12 — KG+LLM is the technology behind Google Knowledge Graph 2.0
- kg-04 — Entity extraction is the foundation of KG ingestion at scale
- kg-06 — KG Completion automates Freebase/Wikidata growth
- kg-08 — GNNs are applied to Google-scale graph processing
- ir-12 — KG scaling mirrors search index scaling challenges
- ds-01-intro — Google KG is distributed - same consistency problems
- dist-14-sharding