AI Engineering

Evaluation: How to Know an LLM Didn't Break After Deploy

Цели урока

Understand why LLMs can't be tested like regular software (non-determinism, no single correct answer)
Learn why BLEU/ROUGE don't work for open-ended generation - and what to use instead
Build LLM-as-Judge evaluation with structured output, pairwise comparison, and position bias protection
Master RAGAS for RAG system evaluation and Langfuse for production evaluation pipeline
Integrate eval into CI/CD with regression detection and threshold-based blocking on every PR

Did the LLM get worse after a prompt change? Intuition doesn't work - the model is stochastic. An eval suite is the only ground truth. Without one, every deploy is Russian roulette. Notion AI learned this the hard way: updated a system prompt, all manual tests green, deploy shipped - then two days later support is drowning in complaints. Responses 40% longer. User satisfaction down 15%. The team wasn't measuring conciseness. Systematic evaluation is insurance against not knowing what isn't being measured.

LMSYS Chatbot Arena - the largest public benchmark, 500K+ votes. ELO rating of models updates in real time based on pairwise human evaluation
OpenAI uses LLM-as-Judge + human eval + red teaming for every model release. The eval suite for GPT-4o took months
RAGAS (RAG Assessment) became the standard for evaluating RAG systems: faithfulness, answer relevancy, context precision, context recall - four independent LLM-judge calls
Langfuse raised USD 4M seed (2023) for open-source LLM observability + evaluation. Self-hosted is free, cloud is USD 0.01 per 1K events

How LLM Evaluation Took Shape

Evaluating language models systematically is a young practice. **HELM (Stanford CRFM, 2022)**, the Holistic Evaluation of Language Models, pushed the idea of measuring models across many tasks and metrics rather than a single benchmark. As models grew more open-ended, fixed answers stopped being enough. **LLM-as-a-judge, popularized by Zheng et al. with MT-Bench (2023)**, used a strong model to score another model's responses, making it practical to evaluate free-form output at scale. Together these ideas underpin the split between offline evaluation on a fixed suite and online evaluation on live traffic.

Предварительные знания

Production Prompt Patterns: system/user/assistant, Few-Shot, Chain-of-Thought

Why Testing LLMs Is an Unsolved Problem

The prompt changed. All manual tests are green. Deploy went fine. Two days later, support is drowning: responses are 40% longer, repeat the same idea in different words, user satisfaction dropped 15%. The team didn't catch the degradation - because they weren't measuring conciseness. This is a real Notion AI incident, November 2023.

Intuition doesn't work - the model is stochastic. The same prompt at `temperature > 0` gives different answers every time. There is no single "correct" answer. A response can be **factually right but poorly worded** - or **perfectly written but containing a hallucination**. Without an eval suite, every deploy is Russian roulette.

Aspect	Traditional Software	LLM Output
Determinism	One input - one output	One input - many outputs
Criterion	Correct / incorrect	Spectrum: from hallucination to ideal
Metric	Pass / Fail	Scoring: 0.0 - 1.0 across N dimensions
Ground truth	Always defined	Often subjective
Scale	1000+ tests / second	~5-10 evals / second (with LLM-judge)
Cost	~USD 0	USD 0.01-0.10 per evaluation (LLM-judge)

LLM systems are evaluated across multiple **dimensions simultaneously**: factual correctness (no hallucinations), relevance (response matches the question), conciseness (not too long), tone (brand style), safety (no harmful content). Improving one dimension easily causes degradation in another - which is exactly why an eval suite is needed, not manual spot-checks.

Why doesn't a unit test like assertEquals(llm(prompt), expected) work for testing LLMs?

BLEU, ROUGE, BERTScore - and Why They Don't Work for LLMs

BLEU was invented in 2002 for machine translation. The idea is simple: count how many n-grams from the hypothesis appear in the reference. Works when there's one "correct" translation. But LLM responses have no single right version - and that's exactly where BLEU breaks. High BLEU score doesn't mean a good answer. Low BLEU doesn't mean a bad one.

ROUGE is the same idea but for summarization: recall instead of precision. ROUGE-L measures the longest common subsequence between reference and generated text. Same problem: the metric counts token overlap, not meaning. "14 days" and "two weeks" - ROUGE says: different. BERTScore via embeddings says: same thing.

BERTScore solves the synonym problem: instead of comparing tokens - comparing embeddings. In production, `text-embedding-3-small` (USD 0.02 per 1M tokens) works as a convenient proxy. Cosine similarity between the reference embedding and the generated embedding is semantic closeness. "14 days" and "two weeks" get similarity ~0.92. A nonsensical response - ~0.30.

Metric	What It Measures	Cost	When to Use	Trap
BLEU	Precision of n-gram matches	Free	Machine translation - and only that	Useless for open-ended generation
ROUGE-L	Recall - coverage of key phrases	Free	Summarization (with caution)	Doesn't see synonyms
BERTScore / embedding sim	Semantic similarity	~USD 0.0001	Any task - best automated metric	10-50x slower than ROUGE
Exact Match	Exact string match	Free	Classification, structured extraction	Too strict
F1 Token Overlap	Word set intersection	Free	QA - quick rough estimate	Same as ROUGE without LCS

**ROUGE-L = 0.85 doesn't mean "good response".** A response can contain all key words from the reference but be meaningless - or conversely, give the right answer in different words and score ROUGE 0.20. Automated metrics are a filter for regression testing, not a final quality assessment.

What is the main advantage of BERTScore (embedding similarity) over ROUGE?

LLM-as-Judge and RAGAS: A Model Evaluates a Model

LMSYS research (2023): GPT-4 ratings correlate with human ratings at **85%+** - comparable to inter-annotator agreement between humans. That's a breakthrough. It means GPT-4o as a judge is almost human, but 1000x cheaper and never gets tired.

For RAG systems, the standard is **RAGAS** (RAG Assessment). It measures four components: faithfulness (response doesn't contradict the context), answer relevancy (response is relevant to the question), context precision (no noise in the context), context recall (context covers the needed information). Each component is a separate LLM-as-judge call. Langfuse integrates RAGAS directly into the observability pipeline.

**G-Eval** (LMSYS, 2023) is a more reliable variant of LLM-as-judge: the model generates chain-of-thought reasoning before the final score. This reduces score variance and raises correlation with humans to 88%+. For pairwise comparison - order randomization is mandatory: GPT-4 and Claude are documented to give higher scores to the first response in the prompt (position bias).

**Position bias** is a documented problem with LLM-as-Judge. GPT-4 and Claude tend to give higher ratings to the first response in the prompt. Solution: run each evaluation twice with swapped response order and average the results. If (A,B) picks A but (B,A) picks B - that's a tie.

How can position bias be eliminated in pairwise comparison with LLM-as-Judge?

Human Evaluation: When a Model Can't Evaluate a Model

LLM-as-Judge fails where the judge model has the same blind spots as the model being evaluated. Hallucinations: the judge may not know the right answer and score a lie highly. Safety in medicine: "this medication is safe" - the judge is not a doctor. Cultural nuances: tone, appropriateness, idioms - the model doesn't feel context. These need a human.

Method	Speed	Cost	Reliability	When to Use
Automated (ROUGE, BERTScore)	1000/sec	USD 0	Low-medium	CI/CD: regression detection
LLM-as-Judge (G-Eval, RAGAS)	5-10/sec	USD 0.01-0.10	Medium-high	Daily eval, A/B testing, RAG quality
Human Eval	10-50/hr	USD 1-5 per rating	High	Monthly deep review, safety-critical
Pairwise (LLM)	2-5/sec	USD 0.02-0.20	Medium-high	Comparing models / prompts

Why are "control tasks" (gold standard) included in the human eval batch?

Eval Pipeline in CI/CD: Langfuse and Automated Checks

An eval pipeline in CI/CD is the same idea as regular tests: **don't merge the PR if the eval score drops below the threshold**. The difference - LLM eval uses scoring instead of pass/fail. Every prompt change, model update, or config change is automatically verified against a golden dataset.

**Langfuse** (open-source, USD 0 self-hosted) integrates evaluation directly into observability. Every production trace can be tagged with an eval score: LLM-judge runs async after the response, results aggregate in the dashboard. When the avg score drops below threshold - Slack alert fires. This closes the loop: deploy - tracking - evaluation - alert.

Recommended production rhythm: **on every PR** - automated metrics + LLM-judge on a golden dataset of 50-200 cases (~USD 1-8 per run). **Weekly** - extended eval on 500+ cases + RAGAS for RAG components. **Monthly** - human eval on 50-100 cases to calibrate LLM-judge and identify blind spots.

What should block merging a PR that changes a system prompt?

BLEU score measures LLM response quality

BLEU was designed for machine translation (2002), where there is one correct translation. For open-ended generation it's useless: high BLEU doesn't mean a good answer, low BLEU doesn't mean a bad one

BLEU counts n-gram overlap between hypothesis and reference. If the model gives the right answer in different words - BLEU penalizes it. If it copies a nonsensical reference - BLEU rewards it. In academic papers, BLEU still appears for comparing translation systems. In production LLM evaluation - it's an antipattern. Use LLM-as-Judge or embedding similarity instead.

LLM System Evaluation

LLMs can't be tested with exact match: one prompt - many correct answers. Scoring across dimensions is needed
BLEU/ROUGE - for machine translation and summarization. For open-ended LLM generation - the wrong tool
BERTScore via embeddings (text-embedding-3-small, 1536 dim) sees synonyms - the best automated metric
LLM-as-Judge: GPT-4o correlates with humans at 85%+. G-Eval + chain-of-thought raises it to 88%+. Position bias eliminated by randomization
RAGAS - standard for RAG: faithfulness, answer relevancy, context precision, context recall
CI/CD eval pipeline: every PR with a prompt change - automated metrics + LLM-judge. Langfuse for production monitoring

What's Next

Evaluation reveals problems. The next step is handling the errors and failures that evaluation uncovers.

Error Handling for LLMs — Eval detects degradation, error handling addresses runtime failures: hallucinations, timeouts, malformed output
Guardrails — Eval + guardrails = defense in depth: eval catches quality issues, guardrails catch safety issues
Observability — Eval scores are part of the observability dashboard. Langfuse connects evaluation with production monitoring

Связанные уроки

aie-06-prompt-patterns — Eval measures whether prompts actually work
aie-32-error-handling-llm — Eval surfaces failure modes to handle
aie-33-guardrails — Eval validates guardrail effectiveness
aie-35-observability — Production traces feed offline evaluation
ml-05-evaluation — Same train/test discipline for model quality
stat-05-hypothesis