Knowledge Graphs

Knowledge Extraction: from text to graph

Google's Knowledge Graph holds 500 billion facts about 5 billion entities. None of those facts were typed in by hand. The pipeline that built it (NER, relation extraction, coreference, entity linking) is the same one Wikidata, Diffbot, and every enterprise KG run today. Get any stage wrong and the whole graph rots: a missed coreference splits one person into three nodes, a wrong entity link merges two distinct companies into one.

  • **Google Knowledge Graph** powers entity panels in search results, voice assistants, and Maps - extracted from Wikipedia, Wikidata, and structured web data via industrial NER and entity linking pipelines
  • **Diffbot** runs continuous web-scale extraction over billions of pages, building a commercial KG with 10B+ entities sold to enterprises that need fresh structured facts
  • **Bloomberg** uses NER + relation extraction on financial news in real time to feed entity-aware search, sentiment, and event detection for trading floors

Historical context

In May 2012, Google's SVP of Search Amit Singhal announced the Knowledge Graph with the tagline "things, not strings." The system, built partly on Freebase (acquired in 2010), connected 500 million entities and 3.5 billion facts extracted from structured and unstructured sources. Building the Knowledge Graph required industrial-scale NER, relation extraction, and entity linking pipelines. Wikidata launched the same year as an open alternative. Today, KGs underpin Siri, Alexa, Google Assistant, and most enterprise search systems. Freebase was shut down in 2016 after its data was migrated to Wikidata.

NER: Named Entity Recognition

**NER (Named Entity Recognition)** identifies mentions of named entities in text and classifies them by type (person, organization, location, etc.). Modern NER uses BERT fine-tuned with the **BIO tagging scheme**: B-TYPE marks the first token of an entity, I-TYPE marks continuation tokens, O marks non-entity tokens. Fine-tuned BERT models achieve F1 > 90% on standard benchmarks.

**Subword tokenization challenge**: BERT splits words into subword tokens ('Washington' -> ['Wash', '##ington']). For NER, only the first subword of each word gets a label; continuation subwords inherit the parent label. This requires careful alignment between BERT's offset mapping and the original text to extract correct entity spans.

Why is the BIO tagging scheme preferred over simple IO tagging for NER?

Relation Extraction: connections between entities

**Relation Extraction (RE)** identifies semantic relationships between entity pairs in text. Classical pipeline: (1) run NER, (2) for each entity pair, classify the relation. **Joint NER+RE models** like REBEL solve both tasks simultaneously as sequence-to-sequence generation, eliminating the error propagation that occurs when NER mistakes cascade into RE.

**Distant supervision** scales relation extraction training data without manual annotation: align a KG with text, assume any sentence mentioning two entities that have a relation in the KG is a training example for that relation. This produces noisy labels (some sentences mention entities without expressing the relation), but the scale (millions of examples) compensates. REBEL was trained with distant supervision on Wikipedia + Wikidata.

Why are joint NER+RE models (like REBEL) preferred over a pipeline of separate models?

Coreference Resolution: one entity, many mentions

**Coreference resolution** links all textual mentions that refer to the same real-world entity. In 'Elon Musk founded SpaceX. He wanted to make humanity multiplanetary. The company launched its first rocket in 2006,' the words 'He' and 'Musk' are coreferential; 'The company' and 'SpaceX' are coreferential. Without resolving these chains, a KG built from this text would contain separate nodes for each mention.

**Hobbs algorithm vs neural**: the Hobbs algorithm (1978) uses syntactic heuristics (prefer the nearest antecedent that matches grammatical number and gender) and is still used in production systems for its speed. Neural models (SpanBERT coref, c2f-coref) achieve significantly higher F1 on benchmarks but are 10-100x slower. For high-throughput KG pipelines, rule-based pronoun resolution is often combined with neural models only for ambiguous cases.

Why is coreference resolution necessary for building a high-quality Knowledge Graph?

Entity Linking: connecting to external KGs

**Entity Linking (EL)** maps entity mentions detected by NER to canonical entries in a reference knowledge base (Wikidata, DBpedia, domain KG). Three steps: (1) **candidate generation** - retrieve plausible KG entities for a mention, (2) **disambiguation** - select the correct candidate using context, (3) **NIL detection** - determine if no KG entry exists. EL enables merging facts from multiple documents about the same real-world entity.

**Production EL with dense retrieval**: modern EL systems (BLINK, mGENRE) use a Two-Tower bi-encoder to embed both (mention + context) and (entity description), then retrieve candidates via ANN search over the full KG (88M Wikidata entities). A cross-encoder re-ranker then scores the top-100 candidates. This combination achieves F1 > 85% on AIDA-CoNLL while scaling to hundreds of thousands of documents per day.

Why is entity linking necessary for merging a Knowledge Graph from multiple sources?

Key Takeaways

  • **NER** identifies entity mentions and types via BIO-tagged BERT fine-tuning - F1 > 90% on standard benchmarks, with subword alignment as the main implementation pitfall
  • **Relation Extraction** classifies semantic links between entity pairs; joint NER+RE models like REBEL avoid the cascading error problem of pipelined approaches
  • **Coreference Resolution** merges all mentions of the same entity ('Musk', 'he', 'the founder') into a single chain - without it, every pronoun creates a duplicate KG node
  • **Entity Linking** maps mentions to canonical Wikidata QIDs through candidate generation, contextual disambiguation, and NIL detection - the step that lets multiple sources merge into one graph

Related Topics

Topics that build on or extend Knowledge Extraction:

  • kg-03 — Knowledge graph schema and RDF/OWL fundamentals are required before extraction
  • cv-04 — BERT-based NER and ResNet share the same transfer learning pattern: pretrain on large data, fine-tune on task-specific labels
  • rec-04 — Entity embeddings from KG extraction enrich recommendation models as side information for cold-start items

Вопросы для размышления

  • REBEL trains on distant supervision: any sentence mentioning two entities with a known KG relation becomes a positive example. Where does this assumption break down, and how do practitioners filter the noise?
  • Coreference resolution costs 10-100x more compute than NER on long documents. For a real-time news pipeline ingesting 100K articles per hour, what hybrid strategies trade quality for throughput?
  • Wikidata holds ~100M entities. A new product mention in a news article often refers to something with no QID. How should the entity linker handle NIL cases without polluting the KG with duplicate or speculative nodes?

Связанные уроки

  • kg-03 — Knowledge graph schema and RDF/OWL fundamentals are required before extraction
  • cv-04 — BERT-based NER and ResNet share the same transfer learning pattern: pretrain on large data, fine-tune on task-specific labels
  • rec-04 — Entity embeddings from KG extraction enrich recommendation models as side information for cold-start items
  • bt-04-dns-tls — Entity linking to Wikidata mirrors DNS resolution: a local mention resolves to a canonical global identifier
  • kg-05 — KG completion and embedding methods (TransE, RotatE) operate on the extracted graphs built here
  • nlp-04
  • nlp-05
  • ir-01
Knowledge Extraction: from text to graph

0

1

Sign In