Generative AI
Evaluation and Benchmarks
GPT-4 scores 86.4% on MMLU. Gemini Ultra scores 90.0%. Claude 3 Opus scores 86.8%. These numbers appear everywhere - but what do they actually measure? MMLU tests multiple-choice knowledge questions. Real-world LLM performance depends on instruction following, reasoning, safety, and task-specific accuracy. Benchmark gaming is real: models trained on benchmark-adjacent data score higher without being more capable.
- OpenAI's GPT-4 technical report disclosed that the model was evaluated on held-out exam questions to prevent data contamination. The evaluation design is as important as the score.
- LMSYS Chatbot Arena has accumulated 600,000+ human preference votes comparing model responses side-by-side. It is considered the most honest benchmark because it measures actual human preference, not proxy metrics.
- Anthropic uses automated "Constitutional AI critiques" as part of their eval suite - the model evaluates its own outputs for safety violations, scaling human evaluation effort.
Предварительные знания
- How LLMs generate text and what pre-training does to model knowledge
- Basic statistics: percentages, confidence intervals, why sample size matters
- Familiarity with RLHF and human preference data (it underpins preference-based evaluation)
From GLUE to Chatbot Arena: how the field learned to measure language models
Before 2018 there was no shared yardstick for general language understanding. Alex Wang and collaborators introduced GLUE in 2018, a collection of nine sentence-level tasks bundled into one score. Models saturated it within a year, so SuperGLUE arrived in 2019 with harder tasks. As models grew, narrow tasks stopped being informative. Dan Hendrycks and colleagues published MMLU in 2020: 57 subjects from elementary math to professional law, designed to probe broad knowledge that a single fine-tuned model could not fake. In 2022 Stanford CRFM led by Percy Liang released HELM (Holistic Evaluation of Language Models), arguing that a single accuracy number hides the picture, so it scored models across accuracy, robustness, fairness, calibration, and toxicity together. The last shift was philosophical. Static benchmarks leak into training data and stop reflecting real use. In 2023 LMSYS launched Chatbot Arena, where people compare two anonymous models on their own prompts and vote, producing an Elo ranking from millions of head-to-head matches. The progression mirrors the models themselves: each benchmark answered the previous one's blind spot, and each got saturated or contaminated in turn.
MMLU
**MMLU** is a key technique in Evaluation and Benchmarks. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
MMLU is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does MMLU primarily solve in generative AI systems?
HumanEval
**HumanEval** is a key technique in Evaluation and Benchmarks. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
HumanEval is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does HumanEval primarily solve in generative AI systems?
LMSYS Chatbot Arena
**LMSYS Chatbot Arena** is a key technique in Evaluation and Benchmarks. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
LMSYS Chatbot Arena is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
What problem does LMSYS Chatbot Arena primarily solve in generative AI systems?
Automated Evaluation (LLM-as-Judge)
**Automated Evaluation (LLM-as-Judge)** is a key technique in Evaluation and Benchmarks. It addresses specific challenges in building reliable, efficient, and scalable generative AI systems in production.
Automated Evaluation (LLM-as-Judge) is regularly tested in GenAI engineering interviews at OpenAI, Anthropic, Google DeepMind, and AI-forward product companies. Understanding the trade-offs and failure modes demonstrates production-level expertise.
Evaluation and Benchmarks requires specialized AI research expertise unavailable to most engineering teams
Evaluation and Benchmarks is implementable with standard open-source tools and cloud APIs; the key skill is understanding the trade-offs and when to apply each technique
The LLM ecosystem (vLLM, trl, Langchain, LlamaIndex, Instructor) has productized most generative AI patterns. The engineering challenge is choosing the right tools and understanding their failure modes - not building from scratch.
What problem does Automated Evaluation (LLM-as-Judge) primarily solve in generative AI systems?
Related Topics
These topics form the surrounding Evaluation and Benchmarks ecosystem:
- RLHF and DPO — Win rate vs reference model is the standard metric for measuring RLHF/DPO alignment quality
- AI Safety and Alignment — Safety evaluations (refusal rate, harmful content generation) require specialized benchmark design beyond standard capability benchmarks
- GenAI in Interviews — Benchmark literacy - knowing what MMLU measures vs what it does not - is a senior GenAI engineer signal
Key Ideas
- **MMLU (Massive Multitask Language Understanding):** 57-subject multiple choice test covering STEM, humanities, social science; measures breadth of knowledge; susceptible to data contamination
- **HumanEval:** 164 Python coding problems with unit tests; measures functional code generation; pass@k metric evaluates probability of at least one correct solution in k samples
- **LMSYS Chatbot Arena:** crowdsourced human preference evaluation via blind side-by-side comparisons; Elo rating system; most reflective of real user preference
- **LLM-as-Judge:** using a capable LLM (GPT-4, Claude Opus) to automatically evaluate other LLMs' responses; scales human evaluation; introduced positional bias and self-preference issues
Вопросы для размышления
- How does Evaluation and Benchmarks change when moving from a prototype to a production system serving 1 million users?
- What are the primary failure modes in Evaluation and Benchmarks and what monitoring catches them before users are affected?
- How would you explain the trade-offs in Evaluation and Benchmarks to a non-technical stakeholder who needs to approve the infrastructure budget?
Связанные уроки
- gai-07 — Benchmarks measure what alignment actually changed
- gai-24 — Evaluation knowledge is tested in interviews
- aie-31-evaluation — Production LLM evaluation pipelines
- ml-53-ab-testing-ml — Arena ranking is statistical A/B comparison of models
- stat-05-hypothesis — Comparing benchmark scores needs significance testing