Natural Language Processing
BERT and Masked Language Models
Предварительные знания
- The attention and self-attention mechanism - the core of the Transformer encoder
- Encoder-Decoder architecture: BERT uses only the encoder half
BERT and bidirectional pre-training
In October 2018, Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova at Google published 'BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding'. The central idea was masked language modeling: the model sees the whole sentence and predicts randomly masked tokens using context from both left and right at once. Before this, ELMo combined two unidirectional models and GPT-1 was strictly left-to-right; BERT was genuinely bidirectional thanks to the Transformer encoder (Vaswani et al., 2017). A second pre-training objective was Next Sentence Prediction. BERT set new state-of-the-art results on 11 GLUE and SQuAD tasks at once and cemented the 'pretrain then fine-tune' paradigm that still dominates NLP. BERT's masked LM remains the default recipe for pre-training encoders (RoBERTa, DeBERTa, ModernBERT).
November 2018: Google's team publishes BERT. Within three months it sets records on 11 NLP benchmarks simultaneously - GLUE, SQuAD, Named Entity Recognition. This kind of NLP breakthrough had not happened since word2vec. The trick: BERT never saw a single labeled task during pre-training - only raw text from Wikipedia and BookCorpus. Everything needed for a downstream task is added in a few hours of fine-tuning.
- **Google Search** has used BERT since 2019 to understand search queries - especially conversational queries with prepositions and context-dependent words. This affected 10% of all queries at launch
- **Bing** and most enterprise search systems use BERT-based rankers for document relevance scoring - [CLS] embeddings serve as vector representations of documents in semantic search
- **GitHub Copilot** in its code review capabilities uses CodeBERT - a variant of BERT pre-trained on GitHub code - for understanding function semantics and comments
Masked Language Modeling: BERT Learns to Fill Gaps
GPT reads text left to right and predicts the next word. Powerful, but one-sided: the model sees only the left context. BERT (Bidirectional Encoder Representations from Transformers, 2018) inverts the task: 15% of tokens are randomly masked, and the model must restore them by seeing the ENTIRE context - both left and right. 'The bank [MASK] money' - stores. 'The river bank [MASK] steep' - is. Only bidirectional context can resolve such polysemy. As a result, BERT token embeddings are context-dependent - something Word2Vec could never achieve.
Of the 15% selected for masking: 80% are replaced with [MASK], 10% with a random token, 10% are left unchanged. This prevents overfitting to the [MASK] token and teaches the model to work with real words. During fine-tuning, [MASK] never appears - the model must handle normal text.
Why does BERT mask only 80% of the selected 15% of tokens rather than all 15%?
Next Sentence Prediction and Its Downfall
BERT is trained on two tasks simultaneously. NSP (Next Sentence Prediction): the model is given two sentences and must predict whether the second follows the first in the original text. The goal is to teach the model to understand relationships between sentences for tasks like Question Answering and Natural Language Inference. In practice, NSP turned out to be weak: the model easily solves it through lexical cues without understanding real coherence. RoBERTa (2019) removed NSP and showed improvement - one of the most surprising results in NLP.
Better alternatives to NSP have been proposed: Sentence Order Prediction (SOP) in ALBERT - predicting the correct order of two consecutive sentences; Whole Word Masking - masking complete words instead of subword tokens. The current consensus: MLM alone is sufficient for most downstream tasks; inter-sentence tasks are only needed when working with sentence pairs.
RoBERTa removed NSP from pre-training and showed improvement. What does this reveal about NSP?
Fine-Tuning: One BERT for a Hundred Tasks
BERT pre-trains for 4 days on 64 TPUs. After that, adapting it to a new task requires only a few hours of fine-tuning on task-specific data - this is the transfer learning revolution in NLP. For classification: add a linear layer on top of the [CLS] token. For NER: a linear layer on top of every token. For QA: two linear layers predicting the start and end of the answer span. In all cases, the entire BERT is updated through backpropagation - this is full fine-tuning, as opposed to feature extraction.
Typical fine-tuning hyperparameters for BERT: learning rate 2e-5 to 5e-5, batch size 16 or 32, 2-4 epochs. Too large an LR or too many epochs causes catastrophic forgetting of pre-trained knowledge. Warming up for the first 10% of steps with linear decay is standard practice. PEFT (Parameter-Efficient Fine-Tuning) via LoRA updates only 0.1% of parameters with comparable quality.
What is the difference between fine-tuning and feature extraction when using BERT?
The [CLS] Token: A Sentence-Level Semantic Vector
BERT always prepends a special [CLS] (Classification) token to every input. After passing through all 12 (or 24) Transformer layers, [CLS] contains a 768-dimensional vector that has aggregated information from the entire sequence through self-attention. This vector is what gets used for sentence classification. An important nuance: in pre-trained BERT, [CLS] works well for NSP tasks, but for semantic similarity, mean pooling over all tokens or dedicated Sentence-BERT representations perform better.
Sentence-BERT (SBERT) trains a siamese network on sentence pairs, optimizing cosine similarity of [CLS] vectors. This enables efficient similarity computation: 10,000 sentences pairwise - 50 million comparisons would take 65 hours with vanilla BERT, but 5 seconds with SBERT (after pre-computing all embeddings).
The [CLS] token contains sentence-level information from the start, because it is first and 'sees' all following tokens
[CLS] accumulates information through self-attention across all layers. In lower layers [CLS] contains little meaningful information - the useful representation only emerges after passing through all 12 layers
Self-attention is symmetric: [CLS] attends to all tokens, but all tokens also attend to [CLS]. Information accumulates iteratively across layers, not through position. This is why downstream tasks use output from the last layer, not the first
Why does mean pooling of BERT tokens often outperform [CLS] for semantic sentence similarity tasks?
Key Ideas
- **MLM** gives BERT bidirectional context: masking 15% of tokens and predicting them while seeing the full sequence creates context-dependent embeddings unavailable to unidirectional models
- **NSP turned out to be unnecessary**: RoBERTa without NSP outperformed BERT - demonstrating that MLM alone is sufficient for most tasks. Modern models use SOP (Sentence Order Prediction) when inter-sentence coherence matters
- **Fine-tuning in hours** instead of pre-training in days: adding a classification head on top of [CLS] + 2-4 epochs on target data delivers leading results. Mean pooling over tokens often outperforms [CLS] for sentence similarity tasks
Related Topics
BERT is a central hub in modern NLP architecture:
- GPT and Autoregressive LMs — GPT and BERT are opposite approaches: autoregressive vs masked. GPT excels at generation, BERT at understanding
- Machine Translation — BERTScore uses BERT embeddings to evaluate translation quality; mBERT serves as an encoder in hybrid NMT systems
Вопросы для размышления
- BERT learns to predict masked tokens without any explicit labeling. What linguistic knowledge emerges as an emergent property of this simple objective?
- Catastrophic forgetting during fine-tuning is a real problem. How would you fine-tune BERT for medical documentation where general language knowledge and specialized medical knowledge both need to be preserved?
- BERT has a 512-token limit. How would you handle long documents (full articles, legal contracts) during classification?
Связанные уроки
- nlp-13 — GPT is autoregressive where BERT is masked bidirectional
- nlp-10 — Self-attention generalizes seq2seq attention into Transformers
- nlp-07 — Fine-tuned BERT became the default text classifier
- ml-31-transformers — BERT is a Transformer encoder stack
- ml-41-transfer-learning — Pretrain-then-finetune is transfer learning for text
- it-01 — Masked LM minimizes cross-entropy over masked tokens
- dl-01