AI Engineering

Error Handling for LLMs: Hallucinations, Timeouts, Malformed Output, Fallbacks

Цели урока

Classify LLM failure types: timeout, rate limit, malformed output, hallucination
Implement retry with exponential backoff, jitter, and model fallback
Build an output validation layer using Zod and JSON repair
Apply self-consistency and grounding checks for hallucination detection
Design graceful degradation with a fallback hierarchy

LLM APIs fail differently than REST APIs. 429 is not an error - it's normal: wait and retry. 500 is often temporary: retry helps. But the main LLM failure is invisible: the model returned 200, JSON is valid - and it completely lied. A US lawyer in 2023 filed a brief with 6 non-existent precedents generated by GPT. The judge noticed. The lawyer was fined USD 5,000. The model didn't return an error. It computed the next token.

Air Canada: a chatbot promised a ticket discount that didn't exist - the court ordered the company to honor the bot's promise
Chevrolet: a sales bot agreed to sell a car for USD 1 after a prompt injection from a buyer
Samsung: employees uploaded confidential source code to ChatGPT - data leak through training data
Stack Overflow: temporarily banned GPT-generated answers - 80% of reviewed answers contained factual errors

Why "Hallucination" Became an Engineering Problem

The term **hallucination** came from natural language generation research, where it described fluent output that was unfaithful to the source or simply false. As LLMs reached production, this stopped being an academic curiosity: a model can return a 200 with valid JSON and still invent facts, citations, or APIs. Engineers responded by borrowing reliability patterns from distributed systems - **retries with backoff, fallbacks, and circuit breakers** - and applying them to LLM calls, alongside output validation to catch confident but wrong responses. Error handling for LLMs is as much about plausible-but-false output as about timeouts and 429s.

Предварительные знания

Structured Output: Getting LLMs to Return JSON, Schemas, and Typed Data

LLM Failure Modes: timeout, rate limit, malformed output, hallucination

Regular APIs return either an error or a correct response. Two outcomes. LLMs add a third - a **formally correct but factually wrong response**. This isn't an implementation bug. It's a fundamental property of probabilistic generation.

A 429 from OpenAI is not a disaster. It's a signal: slow down, wait, retry. A 500 from the API usually vanishes in a second. But `finish_reason: 'length'` is a silent bomb. The response looks fine. The JSON started. It just never finished.

Failure Type	Example	Detectability
Timeout	Model generates a long response, connection drops after 30s	Easy - HTTP timeout
Rate limit (429)	Exceeded requests per minute or tokens per minute limit	Easy - HTTP status code
Malformed output	Asked for JSON, received text with a ```json...``` markdown wrapper	Medium - requires parsing
Hallucination	Model confidently names a non-existent API function	Hard - requires verification
Partial response	finish_reason: 'length' - response truncated at max_tokens	Easy - check finish_reason

**finish_reason: 'length'** is one of the sneakiest errors. The response looks normal but is cut off mid-sentence. JSON without a closing brace, a list missing its last items, code without a return statement. Always check finish_reason before parsing.

The core insight: error handling for LLMs isn't just about `catch`. It's about **validating every 200 OK**. The model returned success - now the real work begins. Is the response complete? Does it match the expected format? Is it fabricated?

If the API returned 200 - everything is fine

200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200

An LLM is a probabilistic text generator. It doesn't 'know' the truth - it computes the most likely next token. A formally successful HTTP response and semantically correct content are two independent things. That's why every 200 must be validated just as rigorously as any error.

Why is error handling for LLMs harder than for regular REST APIs?

Retry Strategies: exponential backoff, jitter, model fallback

A 429 from OpenAI is not a refusal. It's a polite request to slow down. The problem is when 1000 parallel requests all get that 429 simultaneously. And if they all retry after exactly 2 seconds - the server gets hit with 1000 requests again. Thundering herd. A self-inflicted DDoS.

The solution: **exponential backoff with jitter**. The delay grows exponentially (1s, 2s, 4s, 8s...), plus a random spread. 1000 clients scatter across time. Load smooths out. This is exactly how retry works in AWS SDK, Google Cloud clients, and the OpenAI SDK since version 4.

**Jitter** solves the thundering herd problem: if 100 clients all get a 429 at the same time and all retry after exactly 2 seconds - they'll hit the server simultaneously again. Random spread (jitter) distributes retries over time.

**Model fallback** - the chain gpt-4o → gpt-4o-mini → claude-haiku - is the standard for production AI backends. GPT-4o went down? Switch automatically. Claude too expensive for a simple task? gpt-4o-mini is 16x cheaper. The LLMProvider abstraction makes this switch invisible to business logic.

For malformed output, a dedicated strategy applies: **retry with prompt reinforcement**. First attempt uses a normal prompt. On retry, a hard requirement is added: 'The previous response was invalid JSON. Return ONLY JSON, without markdown wrappers, without explanations.' Usually works on the first retry.

Why add jitter to exponential backoff?

Output Validation: Zod parsing, JSON repair, structured output

LLMs generate text. Business logic expects typed data. Between them is a gap. The model was asked to return JSON. It returned something that looks like JSON, wrapped in ```json, with a trailing comma before `}`, and a 'Here is the result:' preamble at the top.

The validation layer is a three-step pipeline. First, extract JSON from whatever wrapper the model chose. Then repair what can be repaired. Then run it through a Zod schema. Each step is an independent line of defense.

For cases where the JSON is 'almost correct' - missing comma, trailing comma before `}`, unclosed quote - the **jsonrepair** library covers 90% of real-world invalid LLM output:

**Structured Output (strict: true)** solves the format problem but NOT the content problem. The model will return valid JSON with sentiment: 'positive' - but the sentiment might be wrong. Format is HTTP 200. Correctness is an entirely different question.

What does Zod validation do when processing an LLM response?

Hallucination Detection: confidence scoring, self-consistency, grounding

In 2023, an American lawyer filed a brief in which GPT generated 6 court precedents. The cases sounded convincing, the citations were properly formatted, the case numbers looked real. None of them existed. The model didn't know it was hallucinating. It computed the most likely next token.

Hallucination is the sneakiest type of LLM error. In production, it's not a bug - it's a **systemic risk**. No HTTP status, no exception, no stack trace. Just a confident answer containing facts that were never true.

**Self-consistency** - ask the same question N times with temperature > 0. If answers diverge - the model is uncertain
**Grounding check** - verify whether the answer is based on provided context (RAG) or made up from scratch
**Confidence scoring** - ask the model to rate its confidence (0-1) and reject answers below a threshold
**Cross-model verification** - query two models and compare their answers
**Fact extraction + lookup** - extract claims from the answer and verify against a database

**Self-consistency with 3 requests** costs 3x. For healthcare, finance, legal documents - this is justified: the cost of an error is higher than three API calls. For less critical tasks, a grounding check is enough - 1 additional call at a reasonable cost.

How does self-consistency check work for hallucination detection?

Graceful Degradation: fallback responses, cached answers, human handoff

All retries failed. The model is hallucinating. The timeout expired. What does the system show the user? A 500 error? A blank screen? Or - an honest, clear message: 'We couldn't process this right now. A specialist has been notified.'

**Graceful degradation** is a hierarchy: from ideal to minimally acceptable. GPT-4o with full context is the ideal. A pre-written static response is the floor. Between them - several steps, each one better than an error.

**Example for a customer support bot:**

**The source metric** is gold for monitoring. 95% of responses from primary - everything is fine. 20% from cache or fallback - something is broken and needs investigation. This is the earliest quality degradation signal in an AI system - well before user complaints start arriving.

What is the LAST fallback in a graceful degradation chain?

If the API returned 200 - everything is fine

200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200

Key Takeaways

LLMs can return 200 OK with a hallucination - checking HTTP status is not enough
finish_reason: 'length' - a truncated response, dangerous for JSON parsing
Exponential backoff + jitter + model fallback (gpt-4o → gpt-4o-mini → claude-haiku) - protection against all retryable errors
Zod + jsonrepair - reliable parsing of invalid LLM output
Self-consistency (N requests) and grounding check - hallucination detection
Graceful degradation: primary → secondary → cache → fallback → human handoff

What's Next

Error handling protects against technical failures. But LLMs can also generate toxic, unsafe, or manipulative content - and that's the job of guardrails.

Guardrails: LLM Safety — Input/output filtering, NeMo Guardrails, defense-in-depth
Observability for AI Pipeline — Error monitoring, quality drift, cost tracking in production
Cost Management & Optimization — Self-consistency costs 3x - when it's justified vs. when it's overkill

Связанные уроки

aie-07-structured-output — Malformed JSON is a core failure to handle
aie-33-guardrails — Error handling backs guardrail enforcement
aie-30-rate-limiting-ai — Retry with backoff on 429 and timeouts
net-66-resilience — Apply retries, timeouts and circuit breakers
aie-29-cost-management — Retries multiply cost without caps
sd-03-scalability

AI Engineering

Error Handling for LLMs: Hallucinations, Timeouts, Malformed Output, Fallbacks

Цели урока

Classify LLM failure types: timeout, rate limit, malformed output, hallucination
Implement retry with exponential backoff, jitter, and model fallback
Build an output validation layer using Zod and JSON repair
Apply self-consistency and grounding checks for hallucination detection
Design graceful degradation with a fallback hierarchy

Air Canada: a chatbot promised a ticket discount that didn't exist - the court ordered the company to honor the bot's promise
Chevrolet: a sales bot agreed to sell a car for USD 1 after a prompt injection from a buyer
Samsung: employees uploaded confidential source code to ChatGPT - data leak through training data
Stack Overflow: temporarily banned GPT-generated answers - 80% of reviewed answers contained factual errors

Why "Hallucination" Became an Engineering Problem

Предварительные знания

Structured Output: Getting LLMs to Return JSON, Schemas, and Typed Data

LLM Failure Modes: timeout, rate limit, malformed output, hallucination

Failure Type	Example	Detectability
Timeout	Model generates a long response, connection drops after 30s	Easy - HTTP timeout
Rate limit (429)	Exceeded requests per minute or tokens per minute limit	Easy - HTTP status code
Malformed output	Asked for JSON, received text with a ```json...``` markdown wrapper	Medium - requires parsing
Hallucination	Model confidently names a non-existent API function	Hard - requires verification
Partial response	finish_reason: 'length' - response truncated at max_tokens	Easy - check finish_reason

If the API returned 200 - everything is fine

200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200

Why is error handling for LLMs harder than for regular REST APIs?

Retry Strategies: exponential backoff, jitter, model fallback

Why add jitter to exponential backoff?

Output Validation: Zod parsing, JSON repair, structured output

For cases where the JSON is 'almost correct' - missing comma, trailing comma before `}`, unclosed quote - the **jsonrepair** library covers 90% of real-world invalid LLM output:

What does Zod validation do when processing an LLM response?

Hallucination Detection: confidence scoring, self-consistency, grounding

**Self-consistency** - ask the same question N times with temperature > 0. If answers diverge - the model is uncertain
**Grounding check** - verify whether the answer is based on provided context (RAG) or made up from scratch
**Confidence scoring** - ask the model to rate its confidence (0-1) and reject answers below a threshold
**Cross-model verification** - query two models and compare their answers
**Fact extraction + lookup** - extract claims from the answer and verify against a database

How does self-consistency check work for hallucination detection?

Graceful Degradation: fallback responses, cached answers, human handoff

**Example for a customer support bot:**

What is the LAST fallback in a graceful degradation chain?

If the API returned 200 - everything is fine

200 OK only means the HTTP transport worked. Hallucinations, truncated output, semantically invalid JSON - all of these arrive with status 200

Key Takeaways

LLMs can return 200 OK with a hallucination - checking HTTP status is not enough
finish_reason: 'length' - a truncated response, dangerous for JSON parsing
Exponential backoff + jitter + model fallback (gpt-4o → gpt-4o-mini → claude-haiku) - protection against all retryable errors
Zod + jsonrepair - reliable parsing of invalid LLM output
Self-consistency (N requests) and grounding check - hallucination detection
Graceful degradation: primary → secondary → cache → fallback → human handoff

What's Next

Error handling protects against technical failures. But LLMs can also generate toxic, unsafe, or manipulative content - and that's the job of guardrails.

Guardrails: LLM Safety — Input/output filtering, NeMo Guardrails, defense-in-depth
Observability for AI Pipeline — Error monitoring, quality drift, cost tracking in production
Cost Management & Optimization — Self-consistency costs 3x - when it's justified vs. when it's overkill

Связанные уроки

aie-07-structured-output — Malformed JSON is a core failure to handle
aie-33-guardrails — Error handling backs guardrail enforcement
aie-30-rate-limiting-ai — Retry with backoff on 429 and timeouts
net-66-resilience — Apply retries, timeouts and circuit breakers
aie-29-cost-management — Retries multiply cost without caps
sd-03-scalability