Natural Language Processing
Summarization
A doctor seeing 40 patients a day cannot read 50 research papers to check drug interactions. A financial analyst covering 200 stocks cannot read every earnings call transcript. A lawyer cannot review every relevant precedent in a case with thousands of documents. Summarization is the technology that makes information at scale actionable. Claude processes hundreds of millions of document summarization requests per month; Bloomberg Intelligent Summarization condenses earnings calls for traders in seconds; Donut and similar systems summarize legal filings that would take associates hours. This is one of the highest-impact NLP tasks in production today.
- **Bloomberg Intelligent Summarization** condenses earnings calls and analyst reports into 3-sentence executive summaries delivered to traders within seconds of publication, synthesizing hours of spoken content into actionable signals.
- **Anthropic Claude** handles document summarization as one of its highest-volume enterprise use cases - law firms, consultancies, and pharmaceutical companies use it to summarize contracts, clinical trial reports, and research papers at scale.
- **Semantic Scholar** (Allen Institute for AI) auto-generates one-paragraph summaries for millions of academic papers using BART-based models, enabling researchers to triage relevance across the full literature without reading each abstract.
Предварительные знания
- Seq2seq generation and encoder-decoder architectures (BART, T5)
- TF-IDF and cosine similarity for graph-based extractive methods
- The concept of hallucination and why it is dangerous in generation
From Luhn to Neural Summarization
1958. Hans Peter Luhn at IBM publishes "The Automatic Creation of Literature Abstracts". His idea is simple and still alive: a sentence's importance can be scored by the frequency of the significant words it contains, giving rise to extractive summarization. For half a century this statistical approach dominated, from TF-IDF to graph methods like TextRank. The turning point came in 2017, when Abigail See, Peter Liu, and Christopher Manning introduced pointer-generator networks, a model that could both copy words from the source and generate new ones, solving the repetition problem. Then came pretrained seq2seq models: BART and T5 (2019-2020), followed by PEGASUS (Zhang and co-authors, Google, 2020) with its gap sentence generation pretraining built specifically for summarization.
Extractive Summarization
Extractive summarization selects a subset of sentences from the source document and concatenates them as the summary. No new text is generated - every sentence in the output exists verbatim in the input. Classic algorithms: TextRank (graph-based sentence centrality), LexRank (eigenvector centrality on cosine similarity graph), and Lead-3 (first 3 sentences - a strong baseline for news).
Lead-3 achieves ROUGE-1 of ~43 on CNN/DailyMail, competitive with many early neural models, because news articles follow the inverted pyramid style where the most important information appears first. Domain matters: Lead-3 is weak on scientific papers (abstract first, results later) and earnings call transcripts.
Neural extractive models (BertSum, MatchSum) frame sentence selection as a binary classification problem: for each sentence, predict whether it belongs in the summary. They improve over Lead-3 on non-news domains but are rarely worth the compute cost when abstractive models are available.
Why does the Lead-3 baseline (first 3 sentences) perform so well on CNN/DailyMail news summarization?
Abstractive Summarization
Abstractive summarization generates new text that condenses source content - not restricted to source sentences. Encoder-decoder models (BART, T5, Pegasus) are trained to generate human-written summaries. Pegasus (Google, 2020) introduced Gap Sentence Generation (GSG) as a summarization-specific pretraining objective: randomly mask important sentences and train the model to regenerate them, directly simulating the summarization task.
BART (Lewis et al. 2020) fine-tuned on CNN/DailyMail achieves ROUGE-1 of 44.16, ROUGE-2 of 21.28, ROUGE-L of 40.90 - the strongest results on that benchmark at time of publication. Fine-tuning requires only ~10k examples for reasonable quality; production models often further fine-tune on domain-specific data (legal, medical, financial).
Abstractive models hallucinate facts not present in the source - a critical failure mode for news, medical, and legal summarization. Faithfulness evaluation (FactCC, DAE, QuestEval) measures whether generated summary claims are entailed by the source document, independent of n-gram overlap with the reference.
What is Pegasus's Gap Sentence Generation (GSG) pretraining objective, and why is it effective for summarization?
Multi-Document Summarization
Multi-Document Summarization (MDS) generates a single coherent summary from multiple related source documents. Key challenges not present in single-document: cross-document redundancy (same event reported from different angles), contradictions (conflicting facts across sources), and length - concatenating all sources often exceeds model context windows.
Common MDS approaches: (1) hierarchical encoding - encode each document independently, then aggregate; (2) extract-then-abstract - use extractive methods to reduce input size, then abstractive generation; (3) long-context models (Longformer, BigBird) with sparse attention for 16k+ token inputs. Multi-News and WikiSum are standard MDS benchmarks.
LLMs (GPT-4, Claude) handle MDS well via map-reduce prompting: summarize each document independently (map), then ask the model to synthesize the individual summaries into a final summary (reduce). This bypasses context window limits and scales to arbitrary numbers of source documents.
What unique challenge does multi-document summarization face that single-document summarization does not?
Summarization Evaluation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries. ROUGE-1 counts unigrams, ROUGE-2 counts bigrams, ROUGE-L measures longest common subsequence. ROUGE is the standard benchmark metric but has well-known weaknesses: it rewards paraphrases that share no n-grams with 0, and does not detect factual errors.
Faithfulness evaluation is the critical missing piece: a summary can achieve high ROUGE while hallucinating facts. FactCC checks whether each summary sentence is entailed by the source using a BERT-based NLI classifier. QuestEval generates questions from the summary and checks whether the source can answer them. BARTScore frames faithfulness as the log-probability of generating the summary from the source.
Human evaluation remains the gold standard for summarization: annotators rate Coherence, Consistency (faithfulness), Fluency, and Relevance. SummEval (Fabbri et al. 2021) provides a crowdsourced human evaluation dataset showing that model rankings on ROUGE frequently disagree with human rankings - correlation is only ~0.3.
A model achieving highest ROUGE on CNN/DailyMail produces the best summaries in practice
ROUGE-HUMAN correlation on CNN/DailyMail is only ~0.3; models optimized for ROUGE often produce extractive-like output that scores well but reads poorly
ROUGE rewards n-gram overlap with the reference, not faithfulness, coherence, or conciseness - these require separate evaluation metrics or human judgment
What critical failure mode does ROUGE fail to detect in generated summaries?
Key Ideas
- **Extractive summarization** selects verbatim sentences - Lead-3 is a deceptively strong baseline for news (ROUGE-1 ~43), while TextRank and BertSum work better on non-news domains.
- **Abstractive summarization** (BART, Pegasus, T5) generates new text and achieves higher quality, but hallucination is the critical risk - faithfulness evaluation is as important as ROUGE.
- **ROUGE is necessary but insufficient**: human-ROUGE correlation is only ~0.3; production systems require faithfulness metrics (FactCC, QuestEval) alongside ROUGE to catch factual errors.
Related Topics
Summarization draws on generation and evaluation methods from across NLP:
- T5, BART, and Encoder-Decoder Architectures — BART and T5 are the dominant backbone architectures for abstractive summarization fine-tuning
- Question Answering — QuestEval evaluates summarization faithfulness by generating QA pairs from the summary and checking source answerability
Вопросы для размышления
- A pharmaceutical company needs to summarize 10,000 clinical trial reports, where factual accuracy is critical for regulatory compliance - which architecture and evaluation approach would be most appropriate?
- When would extractive summarization be preferable to abstractive even though abstractive models generally score higher on ROUGE?
- How would a map-reduce LLM summarization pipeline handle contradictions between multiple source documents?
Связанные уроки
- nlp-14 — BART and T5 are the standard abstractive summarizers
- nlp-18 — Both produce a focused answer from long text
- nlp-15 — LLMs do zero-shot abstractive summarization
- ml-52-search-ranking — Extractive summarization ranks and selects sentences
- it-01 — Summarization is lossy compression of source meaning
- ml-01-intro