Natural Language Processing
Question Answering
Every modern search engine and AI assistant is a question answering system at its core. When a doctor queries a clinical decision support tool, when an analyst asks a financial AI about earnings, when a student uses Khanmigo - the challenge is identical: locate the right information and return a precise answer, not a list of documents. QA research produced the retriever-reader architecture that powers Google's featured snippets (serving 3 billion users), Alexa's factual responses, and the evidence-grounded answers in Perplexity AI.
- **Google Featured Snippets** uses extractive QA (BERT-based) to select answer spans for 10-30% of all search queries, directly answering questions without requiring users to click through to source pages.
- **IBM Watson for Oncology** applied MRC-based QA to clinical literature to recommend cancer treatment options, processing hundreds of thousands of oncology papers to surface evidence-backed treatment suggestions for physicians.
- **Perplexity AI** processes over 10 million queries per day using an open-domain QA pipeline with real-time web retrieval, presenting cited answers that compete directly with traditional search results.
Предварительные знания
- BERT and fine-tuning an encoder for a downstream task
- Tokenization and a span as a (start, end) pair of positions in text
- The idea of retrieval: finding relevant documents for a query
SQuAD and the Reading Comprehension Era
2016. Pranav Rajpurkar and co-authors at Stanford release SQuAD (the Stanford Question Answering Dataset), 100,000 question-answer pairs over Wikipedia articles where the answer is always a contiguous span of text. This gave the field a clear, reproducible benchmark, and the race toward human-level performance began. A year later, in 2017, Danqi Chen and colleagues introduced DrQA, a system that combined full-Wikipedia search with a reading comprehension model, establishing the retriever-reader architecture for open-domain QA. By 2018, BERT-based models surpassed human performance on SQuAD 1.1, prompting SQuAD 2.0 with unanswerable questions to test whether a model can say "no answer" rather than confidently fabricate one.
Extractive QA
Extractive QA identifies a verbatim answer span within a provided passage. The model predicts a start token index and an end token index - no text is generated. BERT-style models add two linear heads (one for start, one for end) on top of the encoder, trained with cross-entropy over token positions on SQuAD-style datasets.
SQuAD 2.0 introduced the unanswerable question challenge: 50% of questions have no answer in the passage, requiring the model to output a 'no answer' prediction rather than force-extract an incorrect span. Evaluated via both F1 (partial credit for overlapping tokens) and Exact Match (EM).
F1 in QA is token-overlap between predicted and gold answer spans. An answer of 'Google acquired DeepMind' for a gold of 'Google' scores F1 = 0.5. Exact Match only scores 1 for identical strings after normalization.
What output does an extractive QA model produce?
Generative QA
Generative QA produces free-text answers using a sequence-to-sequence or decoder-only model. Unlike extractive QA, the answer need not appear verbatim in the source - the model can synthesize, rephrase, and aggregate information from multiple passages. T5-based models (UnifiedQA, FlanT5) and GPT-4 exemplify this approach.
The tradeoff vs. extractive: generative models are more flexible and handle multi-step questions but are harder to verify and more prone to hallucination. For closed-book QA (no retrieved context), generative models rely entirely on parametric knowledge - useful for benchmarking knowledge retention but risky in production.
Evaluation of generative QA is harder than extractive: ROUGE and BLEU miss semantically correct paraphrases. BERTScore and model-based evaluators (GPT-4 as judge) are increasingly used to assess correctness independent of surface form.
What is the key advantage of generative QA over extractive QA?
Open-Domain QA
Open-domain QA answers questions without a pre-specified context - the system must retrieve relevant passages from a large corpus and then extract or generate the answer. The classic architecture is the retriever-reader pipeline: Dense Passage Retrieval (DPR) fetches top-k passages, and a reader extracts or generates the answer.
DPR (Karpukhin et al. 2020) trains separate BERT encoders for questions and passages using in-batch negatives. DPR improved over BM25 top-20 recall from 59.1% to 79.4% on Natural Questions. Fusion-in-Decoder further improves by encoding each passage independently with T5, then concatenating all encoder outputs for joint decoding.
Fusion-in-Decoder (FiD) scales to 100 retrieved passages by encoding each independently and concatenating encoder representations for the decoder. This late-fusion strategy improves NQ EM from 41.5 to 51.4 - the decoder aggregates evidence across all passages simultaneously.
How does DPR differ from BM25 for open-domain QA retrieval?
Reading Comprehension Benchmarks
Machine Reading Comprehension (MRC) benchmarks test whether models truly understand text or pattern-match. SQuAD, SQuAD 2.0, TriviaQA, and HotpotQA each stress different capabilities. HotpotQA requires multi-hop reasoning: the answer can only be found by connecting information across two or more passages.
Models achieve human parity on SQuAD 1.1 (EM > 82) but fall behind on adversarial datasets like AdversarialQA, where questions are crafted to insert plausible but incorrect answer spans. This gap exposes pattern-matching rather than genuine comprehension.
Chain-of-thought prompting improves multi-hop QA by 15-25% on HotpotQA: asking GPT-4 to reason step-by-step before producing the final answer forces explicit intermediate reasoning that the model can verify before committing.
A model scoring above human performance on SQuAD 2.0 has human-level reading comprehension
SQuAD measures a narrow extractive span-selection skill; human reading comprehension includes inference, common sense, and multi-document reasoning that SQuAD does not test
SQuAD questions were written after reading the passage, creating structural correlations that models exploit without understanding the text
Why do models that achieve superhuman performance on SQuAD still fail on adversarial MRC datasets?
Key Ideas
- **Extractive QA** predicts start/end token spans - fast, verifiable, and grounded, with BERT-based models reaching superhuman performance on SQuAD 2.0.
- **Open-domain QA** chains retrieval (DPR or BM25) with a reader, scaling to Wikipedia-sized corpora; Fusion-in-Decoder reaches 51.4 EM on Natural Questions by fusing evidence from 100 passages.
- **Adversarial benchmarks** expose that SQuAD-superhuman models are still pattern-matching, not comprehending; multi-hop datasets like HotpotQA remain challenging.
Related Topics
QA builds on retrieval, language modeling, and reasoning:
- RAG: Retrieval-Augmented Generation — Open-domain QA pioneered the retriever-reader pipeline that RAG generalizes to open-ended generation tasks
- BERT and Masked Language Models — BERT fine-tuned with start/end token heads is the backbone of extractive QA on SQuAD and TriviaQA
Вопросы для размышления
- A legal QA system must provide answers with citations to source paragraphs for auditability - which approach (extractive, generative, or hybrid) best fits and why?
- How would a multi-hop QA system approach 'Was the director of Inception born before or after the release of Star Wars?'
- What evaluation metrics beyond Exact Match and F1 would better capture genuine understanding vs. pattern-matching in MRC?
Связанные уроки
- nlp-17 — RAG retrieves context for open-domain QA
- nlp-12 — Extractive QA uses a BERT span head
- nlp-19 — Both compress source text into a focused answer
- aie-12-rag-fundamentals — Production QA is built on RAG pipelines
- ir-04 — Open-domain QA reuses ranked passage retrieval
- ml-01-intro