Natural Language Processing
Named Entity Recognition
Every time a user types a query into a search engine, financial terminal, or voice assistant, the system must identify what real-world entities are being mentioned before it can answer. 'What did Apple pay for Beats?' requires recognizing Apple as an organization and Beats as another - not a fruit and a rhythm. NER is the first extraction layer in knowledge graph construction, question answering, and information retrieval. Bloomberg's financial NER processes 400,000 articles daily to feed trading algorithms; mistakes cost money in milliseconds.
- **Bloomberg Intelligence** uses NER to extract company mentions, financial figures, and dates from 400,000+ news articles daily, feeding signals into algorithmic trading systems with sub-10ms latency requirements.
- **Google Search Knowledge Graph** applies NER at query time to link entity mentions to Knowledge Graph nodes, enabling direct answers ('population of France') instead of just document links - covering 70+ entity types across 100+ languages.
- **Roche/Genentech** deploys biomedical NER (using spaCy with custom models) to extract drug names, gene symbols, and disease mentions from clinical trial documents, accelerating literature review from weeks to hours.
Предварительные знания
- Tokenization and the basic NLP pipeline
- Token-level classification and the idea of sequence labeling
- Word embeddings as input features for models
From MUC and CoNLL to BiLSTM-CRF
Named entity recognition took shape as a task at the Message Understanding Conferences (MUC) in the 1990s, which introduced standard categories such as person, organization, and location. The CoNLL-2003 shared task then set the benchmark dataset and metrics that the field measured progress against for years. In 2001 John Lafferty, Andrew McCallum, and Fernando Pereira introduced conditional random fields (CRFs): the model labeled an entire token sequence at once, accounting for dependencies between neighboring labels, and became the standard for NER. In 2015 Zhiheng Huang and colleagues combined a bidirectional LSTM with a CRF layer: the BiLSTM read context from both sides while the CRF kept the label sequence consistent. That architecture set the template for neural NER right up until transformers arrived.
NER as Sequence Tagging
Named Entity Recognition is formulated as a sequence labeling problem: given a token sequence, assign each token a label from a tag set. Standard entity categories are PERSON, ORG, GPE (geo-political entity), DATE, MONEY, and MISC. The challenge is boundary detection - knowing where an entity starts and ends - and type disambiguation - 'Apple' can be ORG or PRODUCT depending on context.
NER systems treat the problem as structured prediction rather than independent per-token classification because adjacent labels are highly correlated. 'New' labeled B-GPE almost guarantees 'York' is I-GPE. Models that ignore this structure make systematic boundary errors.
The BIO scheme (Begin, Inside, Outside) is the most common label encoding. An alternative, BIOES (Begin, Inside, Other, End, Single), adds explicit End and Single-token tags, giving the model stronger signals for entity boundaries and improving F1 by ~0.5-1% on CoNLL-2003.
Why is NER formulated as structured sequence labeling rather than independent per-token classification?
Conditional Random Fields
A Conditional Random Field (CRF) layer on top of a sequence encoder scores entire label sequences rather than individual labels. The CRF learns a transition matrix T where T[i][j] is the score of transitioning from label i to label j. At inference time, Viterbi decoding finds the globally optimal label sequence in O(n * L^2) time, where L is the number of labels.
BiLSTM-CRF (Lample et al., 2016) became the standard NER architecture: a bidirectional LSTM produces per-token emission scores, and the CRF layer adds transition scores. The model is trained end-to-end with negative log-likelihood of the correct sequence, computed via the forward algorithm.
The CRF transition matrix implicitly learns that I-PER cannot follow B-ORG without a single labeled example of this rule - it emerges from the training data statistics. This is more robust than hard-coded constraints.
What does the CRF transition matrix learn during NER training?
BIO Tagging Schemes
The BIO encoding is the foundation of NER evaluation. B- (Begin) marks the first token of an entity, I- (Inside) marks continuation tokens, and O marks non-entity tokens. BIO allows the model to distinguish two consecutive same-type entities: 'Barack Obama John McCain' gets B-PER I-PER B-PER I-PER, not four I-PER tags.
NER evaluation uses span-level F1: a prediction is correct only if both the span boundaries and the entity type match exactly. Token-level accuracy is misleading because the O class dominates (typically 70-80% of tokens) and inflates scores. The standard CoNLL-2003 English benchmark (Reuters newswire) achieves ~93% F1 with BERT-large-CRF.
Nested entities (e.g., 'Bank of England' containing 'England') cannot be handled by flat BIO schemes. Nested NER requires span-based approaches or multi-layer tagging, and is an active research area for biomedical and legal NLP.
Why does NER evaluation use span-level F1 rather than token-level accuracy?
spaCy: Production NER
spaCy is the dominant production NLP library, used by Explosion's enterprise customers including Airbus, Microsoft, and Roche. Its NER component uses a transition-based parser with a neural network policy rather than a CRF, enabling O(n) inference time regardless of label count. The en_core_web_trf (transformer) model achieves 90.0 F1 on OntoNotes 5.0.
Training custom NER with spaCy v3 uses config-driven pipelines. The training data format is DocBin, and the framework supports incremental training - adding new entity types without catastrophically forgetting existing ones, provided old examples are included in the training mix.
spaCy's Matcher and PhraseMatcher allow rule-based entity tagging that runs before the statistical model. Rules handle high-precision cases (product codes, ticker symbols) with zero training data, while the neural model handles ambiguous cases.
NER models automatically detect all entity types without specifying them upfront
NER is a closed-set problem - the model can only predict entity types seen during training; novel types require retraining or few-shot adaptation
The model learns type-specific patterns from labeled examples; it has no mechanism to generalize to an unseen label it has never associated with any text span
What advantage does spaCy's transition-based NER have over a CRF for production deployment?
Key Ideas
- **NER is structured prediction** - the model assigns labels to a sequence jointly, not independently, because adjacent labels are highly correlated and boundary coherence matters for F1.
- **BiLSTM-CRF** was the dominant architecture until transformers: the LSTM captures contextual emission scores and the CRF layer enforces valid label transitions via Viterbi decoding.
- **Evaluation is span-level F1** (exact boundary + correct type), not token accuracy - because the O class dominates and would inflate accuracy for degenerate models.
Related Topics
NER connects to extraction and downstream NLP applications:
- Information Extraction — NER is the first stage of IE pipelines - extracted entities become arguments in relation and event extraction
- BERT and Masked Language Models — Fine-tuned BERT with a token classification head replaced BiLSTM-CRF as the accuracy leader on CoNLL-2003
Вопросы для размышления
- How would a NER system's evaluation change if evaluated on social media text vs. newswire text, and what factors drive the difference?
- When building a financial NER system, what entity types beyond standard PERSON/ORG/DATE would be most valuable, and how would training data be collected?
- What are the failure modes of BIO tagging for nested entities like 'Bank of [England]' where England is a GPE inside an ORG?
Связанные уроки
- nlp-12 — Fine-tuned BERT with token head replaced BiLSTM-CRF
- nlp-20 — NER is the first stage of information extraction pipelines
- nlp-07 — NER is token-level classification vs document-level
- prob-09-discrete-dist — CRF models a joint distribution over tag sequences
- cv-10 — Object detection labels image regions like NER labels spans
- ml-01-intro