AI Engineering
Error Handling for LLMs: Hallucinations, Timeouts, Malformed Output, Fallbacks
Цели урока
- Classify LLM failure types: timeout, rate limit, malformed output, hallucination
- Implement retry with exponential backoff, jitter, and model fallback
- Build an output validation layer using Zod and JSON repair
- Apply self-consistency and grounding checks for hallucination detection
- Design graceful degradation with a fallback hierarchy
LLM APIs fail differently than REST APIs. 429 is not an error - it's normal: wait and retry. 500 is often temporary: retry helps. But the main LLM failure is invisible: the model returned 200, JSON is valid - and it completely lied. A US lawyer in 2023 filed a brief with 6 non-existent precedents generated by GPT. The judge noticed. The lawyer was fined USD 5,000. The model didn't return an error. It computed the next token.
- Air Canada: a chatbot promised a ticket discount that didn't exist - the court ordered the company to honor the bot's promise
- Chevrolet: a sales bot agreed to sell a car for USD 1 after a prompt injection from a buyer
- Samsung: employees uploaded confidential source code to ChatGPT - data leak through training data
- Stack Overflow: temporarily banned GPT-generated answers - 80% of reviewed answers contained factual errors
Why "Hallucination" Became an Engineering Problem
The term **hallucination** came from natural language generation research, where it described fluent output that was unfaithful to the source or simply false. As LLMs reached production, this stopped being an academic curiosity: a model can return a 200 with valid JSON and still invent facts, citations, or APIs. Engineers responded by borrowing reliability patterns from distributed systems - **retries with backoff, fallbacks, and circuit breakers** - and applying them to LLM calls, alongside output validation to catch confident but wrong responses. Error handling for LLMs is as much about plausible-but-false output as about timeouts and 429s.
Предварительные знания
LLM Failure Modes: timeout, rate limit, malformed output, hallucination
Regular APIs return either an error or a correct response. Two outcomes. LLMs add a third - a **formally correct but factually wrong response**. This isn't an implementation bug. It's a fundamental property of probabilistic generation.
A 429 from OpenAI is not a disaster. It's a signal: slow down, wait, retry. A 500 from the API usually vanishes in a second. But `finish_reason: 'length'` is a silent bomb. The response looks fine. The JSON started. It just never finished.
| Failure Type | Example | Detectability |
|---|---|---|
| Timeout | Model generates a long response, connection drops after 30s | Easy - HTTP timeout |
| Rate limit (429) | Exceeded requests per minute or tokens per minute limit | Easy - HTTP status code |
| Malformed output | Asked for JSON, received text with a ```json...``` markdown wrapper | Medium - requires parsing |
| Hallucination | Model confidently names a non-existent API function | Hard - requires verification |
| Partial response | finish_reason: 'length' - response truncated at max_tokens | Easy - check finish_reason |
**finish_reason: 'length'** is one of the sneakiest errors. The response looks normal but is cut off mid-sentence. JSON without a closing brace, a list missing its last items, code without a return statement. Always check finish_reason before parsing.
The core insight: error handling for LLMs isn't just about `catch`. It's about **validating every 200 OK**. The model returned success - now the real work begins. Is the response complete? Does it match the expected format? Is it fabricated?
If the API returned 200 - everything is fine
200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200
An LLM is a probabilistic text generator. It doesn't 'know' the truth - it computes the most likely next token. A formally successful HTTP response and semantically correct content are two independent things. That's why every 200 must be validated just as rigorously as any error.
Why is error handling for LLMs harder than for regular REST APIs?
Retry Strategies: exponential backoff, jitter, model fallback
A 429 from OpenAI is not a refusal. It's a polite request to slow down. The problem is when 1000 parallel requests all get that 429 simultaneously. And if they all retry after exactly 2 seconds - the server gets hit with 1000 requests again. Thundering herd. A self-inflicted DDoS.
The solution: **exponential backoff with jitter**. The delay grows exponentially (1s, 2s, 4s, 8s...), plus a random spread. 1000 clients scatter across time. Load smooths out. This is exactly how retry works in AWS SDK, Google Cloud clients, and the OpenAI SDK since version 4.
**Jitter** solves the thundering herd problem: if 100 clients all get a 429 at the same time and all retry after exactly 2 seconds - they'll hit the server simultaneously again. Random spread (jitter) distributes retries over time.
**Model fallback** - the chain gpt-4o → gpt-4o-mini → claude-haiku - is the standard for production AI backends. GPT-4o went down? Switch automatically. Claude too expensive for a simple task? gpt-4o-mini is 16x cheaper. The LLMProvider abstraction makes this switch invisible to business logic.
For malformed output, a dedicated strategy applies: **retry with prompt reinforcement**. First attempt uses a normal prompt. On retry, a hard requirement is added: 'The previous response was invalid JSON. Return ONLY JSON, without markdown wrappers, without explanations.' Usually works on the first retry.
Why add jitter to exponential backoff?
Output Validation: Zod parsing, JSON repair, structured output
LLMs generate text. Business logic expects typed data. Between them is a gap. The model was asked to return JSON. It returned something that looks like JSON, wrapped in ```json, with a trailing comma before `}`, and a 'Here is the result:' preamble at the top.
The validation layer is a three-step pipeline. First, extract JSON from whatever wrapper the model chose. Then repair what can be repaired. Then run it through a Zod schema. Each step is an independent line of defense.
For cases where the JSON is 'almost correct' - missing comma, trailing comma before `}`, unclosed quote - the **jsonrepair** library covers 90% of real-world invalid LLM output:
**Structured Output (strict: true)** solves the format problem but NOT the content problem. The model will return valid JSON with sentiment: 'positive' - but the sentiment might be wrong. Format is HTTP 200. Correctness is an entirely different question.
What does Zod validation do when processing an LLM response?
Hallucination Detection: confidence scoring, self-consistency, grounding
In 2023, an American lawyer filed a brief in which GPT generated 6 court precedents. The cases sounded convincing, the citations were properly formatted, the case numbers looked real. None of them existed. The model didn't know it was hallucinating. It computed the most likely next token.
Hallucination is the sneakiest type of LLM error. In production, it's not a bug - it's a **systemic risk**. No HTTP status, no exception, no stack trace. Just a confident answer containing facts that were never true.
- **Self-consistency** - ask the same question N times with temperature > 0. If answers diverge - the model is uncertain
- **Grounding check** - verify whether the answer is based on provided context (RAG) or made up from scratch
- **Confidence scoring** - ask the model to rate its confidence (0-1) and reject answers below a threshold
- **Cross-model verification** - query two models and compare their answers
- **Fact extraction + lookup** - extract claims from the answer and verify against a database
**Self-consistency with 3 requests** costs 3x. For healthcare, finance, legal documents - this is justified: the cost of an error is higher than three API calls. For less critical tasks, a grounding check is enough - 1 additional call at a reasonable cost.
How does self-consistency check work for hallucination detection?
Graceful Degradation: fallback responses, cached answers, human handoff
All retries failed. The model is hallucinating. The timeout expired. What does the system show the user? A 500 error? A blank screen? Or - an honest, clear message: 'We couldn't process this right now. A specialist has been notified.'
**Graceful degradation** is a hierarchy: from ideal to minimally acceptable. GPT-4o with full context is the ideal. A pre-written static response is the floor. Between them - several steps, each one better than an error.
**Example for a customer support bot:**
**The source metric** is gold for monitoring. 95% of responses from primary - everything is fine. 20% from cache or fallback - something is broken and needs investigation. This is the earliest quality degradation signal in an AI system - well before user complaints start arriving.
What is the LAST fallback in a graceful degradation chain?
If the API returned 200 - everything is fine
200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200
An LLM is a probabilistic text generator. It doesn't 'know' the truth - it computes the most likely next token. A formally successful HTTP response and semantically correct content are two independent things. That's why every 200 must be validated just as rigorously as any error.
Key Takeaways
- LLMs can return 200 OK with a hallucination - checking HTTP status is not enough
- finish_reason: 'length' - a truncated response, dangerous for JSON parsing
- Exponential backoff + jitter + model fallback (gpt-4o → gpt-4o-mini → claude-haiku) - protection against all retryable errors
- Zod + jsonrepair - reliable parsing of invalid LLM output
- Self-consistency (N requests) and grounding check - hallucination detection
- Graceful degradation: primary → secondary → cache → fallback → human handoff
What's Next
Error handling protects against technical failures. But LLMs can also generate toxic, unsafe, or manipulative content - and that's the job of guardrails.
- Guardrails: LLM Safety — Input/output filtering, NeMo Guardrails, defense-in-depth
- Observability for AI Pipeline — Error monitoring, quality drift, cost tracking in production
- Cost Management & Optimization — Self-consistency costs 3x - when it's justified vs. when it's overkill
Связанные уроки
- aie-07-structured-output — Malformed JSON is a core failure to handle
- aie-33-guardrails — Error handling backs guardrail enforcement
- aie-30-rate-limiting-ai — Retry with backoff on 429 and timeouts
- net-66-resilience — Apply retries, timeouts and circuit breakers
- aie-29-cost-management — Retries multiply cost without caps
- sd-03-scalability