Natural Language Processing

Information Extraction

The world's knowledge is locked in text - billions of documents, papers, contracts, and news articles that machines cannot query directly. Information extraction converts that unstructured text into structured facts that databases, search engines, and recommender systems can use. Google's Knowledge Graph, with 500 billion facts powering the information panels visible in 90% of all Google searches, was built primarily by IE pipelines. Every AI assistant that knows who the CEO of a company is, when a movie was released, or where a scientist was born relies on extraction pipelines that have been running continuously for over a decade.

  • **Google Knowledge Graph** stores 500 billion facts extracted from the web, Wikipedia, and Wikidata, powering direct answers in Google Search for over 90% of queries and fueling Google Assistant's factual knowledge base.
  • **Bloomberg Terminal** uses relation extraction to identify corporate relationships - subsidiaries, board memberships, supply chain links - across 400,000+ news articles per day, surfacing connections not visible in structured databases.
  • **Drug-drug interaction detection** at pharmaceutical companies like AstraZeneca uses event extraction over PubMed's 35 million papers to identify reported adverse drug interactions, flagging potential safety signals before clinical trials.

Предварительные знания

  • NER: recognizing named entities as the first step of IE
  • BERT classification and fine-tuning for a downstream task
  • Basic understanding of retrieval and the knowledge graph as a structure
  • Named Entity Recognition
  • RAG: Retrieval-Augmented Generation

From MUC to Open Information Extraction

1987. DARPA launches the Message Understanding Conferences (MUC), which ran through 1998. The original task was analyzing military messages, but MUC was what formalized information extraction as a field: it defined named entity recognition, event template filling, and the precision/recall metrics used to evaluate them. The approach relied on closed ontologies, a predefined set of entity and relation types. In 2007, Michele Banko and colleagues at the University of Washington broke that constraint by introducing Open Information Extraction (the TextRunner system): extracting arbitrary (subject, predicate, object) triples from the web with no fixed schema. This opened the path to automatically building planet-scale knowledge graphs, whose descendants include the Google Knowledge Graph and modern GraphRAG.

Relation Extraction

Relation Extraction (RE) identifies typed semantic relationships between named entities in text. Given 'Sundar Pichai is the CEO of Google', the system extracts the triple (Sundar Pichai, CEO_OF, Google). RE is the bridge between unstructured text and structured knowledge - the primary pipeline for populating knowledge graphs from corpora.

Modern RE uses BERT-based classifiers: encode the sentence with special entity markers ([E1] Sundar Pichai [/E1] is the CEO of [E2] Google [/E2]), then classify the relation between E1 and E2. The TACRED benchmark (42 relation types, 100k sentences) is the standard evaluation; RoBERTa-based models reach ~75% F1.

Distant supervision is the dominant RE training data strategy: align a knowledge base (Freebase, Wikidata) with a text corpus, assuming any sentence mentioning two related entities expresses that relation. This is noisy (false positives ~40%) but enables training on millions of examples without manual annotation.

What is the role of entity markers ([E1]...[/E1]) in BERT-based relation extraction?

Event Extraction

Event Extraction identifies events described in text and their arguments. An event has a trigger (the word that expresses the event) and arguments (the entities that fill roles). For 'Apple acquired Beats for $3B in 2014': trigger='acquired', event_type=MERGE_ACQUIRE, args={Buyer:Apple, Artifact:Beats, Price:$3B, Time:2014}.

The ACE 2005 dataset defines 33 event types and 35 argument roles - the standard benchmark. Event extraction is harder than relation extraction because: (1) the trigger must be identified before classifying arguments; (2) events can be negated ('Apple did not acquire Beats'); (3) multiple events can share arguments. State-of-the-art models use joint span extraction with structured prediction.

LLM-based event extraction via structured output prompting (GPT-4 with JSON schema constraints) achieves competitive results with specialized models on ACE 2005, and generalizes zero-shot to novel event types not in the training schema - a key advantage for domain-specific applications.

What makes event extraction harder than relation extraction?

Knowledge Graphs

A knowledge graph (KG) stores facts as (subject, predicate, object) triples in a graph structure. Nodes represent entities; edges represent typed relations. Wikidata contains 100M+ triples; Google Knowledge Graph powers the information boxes in search results; Freebase (now merged into Wikidata) was used to train early NLP models. KGs enable structured reasoning, question answering, and recommendation.

KG Construction pipelines chain NER, RE, and coreference resolution: extract entities, resolve coreferences (Barack Obama = he = the President), extract relations between resolved entities, and insert triples. Knowledge Graph Completion (link prediction) fills missing edges using embedding methods like TransE, RotatE, or graph neural networks.

KG embeddings (TransE, RotatE, ComplEx) learn entity and relation vectors such that subject + relation approximates object in embedding space. TransE achieves MRR of 0.463 on FB15k-237 with only 200-dimensional embeddings, enabling link prediction at million-node scale.

What is Knowledge Graph Completion, and why is it necessary?

Triplet Extraction at Scale

End-to-end triplet extraction extracts (subject, relation, object) triples from text in a single model pass. REBEL (Cabot & Navigli, 2021) fine-tunes BART on 220 Wikidata relation types using a linearized triplet format, achieving ~0.65 F1 on DocRED - a document-level RE benchmark requiring cross-sentence relation inference.

Production triplet extraction pipelines at scale face three challenges: (1) entity disambiguation - 'Apple' (company vs. fruit) requires context; (2) relation canonicalization - 'founded', 'created', 'established' all map to the same Wikidata property; (3) confidence thresholding - noisy extractions degrade downstream QA and recommendations.

Google's original Knowledge Graph (2012) was bootstrapped from Freebase (18M entities), Wordnet, and Wikipedia infoboxes - no general RE models. Today, automated IE from web text adds millions of new triples per month, with human review reserved for high-confidence, high-impact additions.

Information extraction is a solved problem now that LLMs can extract triplets from any text

LLM-based IE is highly flexible but still struggles with entity disambiguation, cross-sentence relations, negation ('did not acquire'), and confidence calibration at the scale of millions of documents

LLMs hallucinate plausible-sounding triplets not supported by the source text, and their throughput/cost make billion-document processing impractical without smaller specialized models

What is relation canonicalization and why is it essential for triplet extraction at scale?

Key Ideas

  • **Relation extraction** identifies typed (subject, predicate, object) triples from text; BERT with entity markers reaches ~75% F1 on TACRED using distant supervision for scalable training data creation.
  • **Event extraction** is harder than RE: triggers must be identified, events can be negated, and arguments can be implicit - joint models achieve ~55% argument extraction F1 on ACE 2005.
  • **KG construction** chains NER + RE + coreference resolution; KG completion fills missing edges using TransE-style embeddings, making the graph useful even when extraction is incomplete.

Related Topics

IE is the structured foundation for search, QA, and recommendation:

  • Named Entity Recognition — NER is the prerequisite step of IE - entities must be identified before relations between them can be extracted
  • Question Answering — KGs built by IE pipelines enable structured QA (SPARQL over Wikidata) and augment neural open-domain QA with factual grounding

Вопросы для размышления

  • How would a financial IE system handle the sentence 'After failing to acquire TikTok, Microsoft shifted its AI strategy' - what events and relations need to be extracted, and what are the main challenges?
  • Distant supervision introduces ~40% false positives in RE training data. What strategies could reduce label noise without requiring full manual annotation?
  • When would a KG-based QA system outperform a purely RAG-based LLM answer, and when would RAG win?

Связанные уроки

  • nlp-08 — NER is the first stage of extraction pipelines
  • nlp-18 — Structured triplets answer factual questions
  • nlp-12 — BERT relation classifiers extract entity links
  • aie-41-knowledge-graphs — Extracted triplets populate knowledge graphs
  • rec-08 — Knowledge graphs power graph-based recommendations
  • stat-08-correlation
Information Extraction

0

1

Sign In