Natural Language Processing

Summarization

A doctor seeing 40 patients a day cannot read 50 research papers to check drug interactions. A financial analyst covering 200 stocks cannot read every earnings call transcript. A lawyer cannot review every relevant precedent in a case with thousands of documents. Summarization is the technology that makes information at scale actionable. Claude processes hundreds of millions of document summarization requests per month; Bloomberg Intelligent Summarization condenses earnings calls for traders in seconds; Donut and similar systems summarize legal filings that would take associates hours. This is one of the highest-impact NLP tasks in production today.

  • **Bloomberg Intelligent Summarization** condenses earnings calls and analyst reports into 3-sentence executive summaries delivered to traders within seconds of publication, synthesizing hours of spoken content into actionable signals.
  • **Anthropic Claude** handles document summarization as one of its highest-volume enterprise use cases - law firms, consultancies, and pharmaceutical companies use it to summarize contracts, clinical trial reports, and research papers at scale.
  • **Semantic Scholar** (Allen Institute for AI) auto-generates one-paragraph summaries for millions of academic papers using BART-based models, enabling researchers to triage relevance across the full literature without reading each abstract.

Предварительные знания

  • Seq2seq generation and encoder-decoder architectures (BART, T5)
  • TF-IDF and cosine similarity for graph-based extractive methods
  • The concept of hallucination and why it is dangerous in generation
  • T5, BART, and Encoder-Decoder Architectures
  • Question Answering

From Luhn to Neural Summarization

1958. Hans Peter Luhn at IBM publishes "The Automatic Creation of Literature Abstracts". His idea is simple and still alive: a sentence's importance can be scored by the frequency of the significant words it contains, giving rise to extractive summarization. For half a century this statistical approach dominated, from TF-IDF to graph methods like TextRank. The turning point came in 2017, when Abigail See, Peter Liu, and Christopher Manning introduced pointer-generator networks, a model that could both copy words from the source and generate new ones, solving the repetition problem. Then came pretrained seq2seq models: BART and T5 (2019-2020), followed by PEGASUS (Zhang and co-authors, Google, 2020) with its gap sentence generation pretraining built specifically for summarization.

Extractive Summarization

Extractive summarization selects a subset of sentences from the source document and concatenates them as the summary. No new text is generated - every sentence in the output exists verbatim in the input. Classic algorithms: TextRank (graph-based sentence centrality), LexRank (eigenvector centrality on cosine similarity graph), and Lead-3 (first 3 sentences - a strong baseline for news).

Lead-3 achieves ROUGE-1 of ~43 on CNN/DailyMail, competitive with many early neural models, because news articles follow the inverted pyramid style where the most important information appears first. Domain matters: Lead-3 is weak on scientific papers (abstract first, results later) and earnings call transcripts.

Neural extractive models (BertSum, MatchSum) frame sentence selection as a binary classification problem: for each sentence, predict whether it belongs in the summary. They improve over Lead-3 on non-news domains but are rarely worth the compute cost when abstractive models are available.

Why does the Lead-3 baseline (first 3 sentences) perform so well on CNN/DailyMail news summarization?

Abstractive Summarization

Abstractive summarization generates new text that condenses source content - not restricted to source sentences. Encoder-decoder models (BART, T5, Pegasus) are trained to generate human-written summaries. Pegasus (Google, 2020) introduced Gap Sentence Generation (GSG) as a summarization-specific pretraining objective: randomly mask important sentences and train the model to regenerate them, directly simulating the summarization task.

BART (Lewis et al. 2020) fine-tuned on CNN/DailyMail achieves ROUGE-1 of 44.16, ROUGE-2 of 21.28, ROUGE-L of 40.90 - the strongest results on that benchmark at time of publication. Fine-tuning requires only ~10k examples for reasonable quality; production models often further fine-tune on domain-specific data (legal, medical, financial).

Abstractive models hallucinate facts not present in the source - a critical failure mode for news, medical, and legal summarization. Faithfulness evaluation (FactCC, DAE, QuestEval) measures whether generated summary claims are entailed by the source document, independent of n-gram overlap with the reference.

What is Pegasus's Gap Sentence Generation (GSG) pretraining objective, and why is it effective for summarization?

Multi-Document Summarization

Multi-Document Summarization (MDS) generates a single coherent summary from multiple related source documents. Key challenges not present in single-document: cross-document redundancy (same event reported from different angles), contradictions (conflicting facts across sources), and length - concatenating all sources often exceeds model context windows.

Common MDS approaches: (1) hierarchical encoding - encode each document independently, then aggregate; (2) extract-then-abstract - use extractive methods to reduce input size, then abstractive generation; (3) long-context models (Longformer, BigBird) with sparse attention for 16k+ token inputs. Multi-News and WikiSum are standard MDS benchmarks.

LLMs (GPT-4, Claude) handle MDS well via map-reduce prompting: summarize each document independently (map), then ask the model to synthesize the individual summaries into a final summary (reduce). This bypasses context window limits and scales to arbitrary numbers of source documents.

What unique challenge does multi-document summarization face that single-document summarization does not?

Summarization Evaluation

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures n-gram overlap between generated and reference summaries. ROUGE-1 counts unigrams, ROUGE-2 counts bigrams, ROUGE-L measures longest common subsequence. ROUGE is the standard benchmark metric but has well-known weaknesses: it rewards paraphrases that share no n-grams with 0, and does not detect factual errors.

Faithfulness evaluation is the critical missing piece: a summary can achieve high ROUGE while hallucinating facts. FactCC checks whether each summary sentence is entailed by the source using a BERT-based NLI classifier. QuestEval generates questions from the summary and checks whether the source can answer them. BARTScore frames faithfulness as the log-probability of generating the summary from the source.

Human evaluation remains the gold standard for summarization: annotators rate Coherence, Consistency (faithfulness), Fluency, and Relevance. SummEval (Fabbri et al. 2021) provides a crowdsourced human evaluation dataset showing that model rankings on ROUGE frequently disagree with human rankings - correlation is only ~0.3.

A model achieving highest ROUGE on CNN/DailyMail produces the best summaries in practice

ROUGE-HUMAN correlation on CNN/DailyMail is only ~0.3; models optimized for ROUGE often produce extractive-like output that scores well but reads poorly

ROUGE rewards n-gram overlap with the reference, not faithfulness, coherence, or conciseness - these require separate evaluation metrics or human judgment

What critical failure mode does ROUGE fail to detect in generated summaries?

Key Ideas

  • **Extractive summarization** selects verbatim sentences - Lead-3 is a deceptively strong baseline for news (ROUGE-1 ~43), while TextRank and BertSum work better on non-news domains.
  • **Abstractive summarization** (BART, Pegasus, T5) generates new text and achieves higher quality, but hallucination is the critical risk - faithfulness evaluation is as important as ROUGE.
  • **ROUGE is necessary but insufficient**: human-ROUGE correlation is only ~0.3; production systems require faithfulness metrics (FactCC, QuestEval) alongside ROUGE to catch factual errors.

Related Topics

Summarization draws on generation and evaluation methods from across NLP:

  • T5, BART, and Encoder-Decoder Architectures — BART and T5 are the dominant backbone architectures for abstractive summarization fine-tuning
  • Question Answering — QuestEval evaluates summarization faithfulness by generating QA pairs from the summary and checking source answerability

Вопросы для размышления

  • A pharmaceutical company needs to summarize 10,000 clinical trial reports, where factual accuracy is critical for regulatory compliance - which architecture and evaluation approach would be most appropriate?
  • When would extractive summarization be preferable to abstractive even though abstractive models generally score higher on ROUGE?
  • How would a map-reduce LLM summarization pipeline handle contradictions between multiple source documents?

Связанные уроки

  • nlp-14 — BART and T5 are the standard abstractive summarizers
  • nlp-18 — Both produce a focused answer from long text
  • nlp-15 — LLMs do zero-shot abstractive summarization
  • ml-52-search-ranking — Extractive summarization ranks and selects sentences
  • it-01 — Summarization is lossy compression of source meaning
  • ml-01-intro
Summarization

0

1

Sign In