AI Engineering

Orchestration Patterns: routing, fallback, chain, map-reduce, branching

Цели урока

Master sequential chain - a processing conveyor with context passing between steps
Implement a parallel (fan-out) pipeline with concurrency control and rate limiting
Build a routing/branching system combining rule-based and LLM-based classification
Apply map-reduce for processing documents that exceed the context window
Implement fallback chains with retry, timeout, cascade, and hedged requests

A pipeline is a conveyor. Each step receives the output of the previous one. If step 3 of 7 fails - the entire chain stops. That's how most AI systems work in production right now. Orchestration is about not crashing, but degrading gracefully: fan-out instead of sequential calls, map-reduce for 200K documents, hedged requests for P99 latency, BullMQ so async tasks don't vanish into void.

Stripe AI - cascade fallback across 3 LLM providers, 99.99% availability - every OpenAI outage is invisible to customers
Notion AI - parallel fan-out: analyzes a document simultaneously for structure, tone, and key ideas, saving 60% of time
Linear AI - conditional routing: bug report → technical model with RAG over docs, feature request → product model without RAG
Cursor (IDE) - map-reduce for repository analysis: each file summarized separately, then summaries merged for 500K+ token architecture understanding
LangGraph 0.2 (2024) - state machine for agent orchestration: explicit graph instead of magic chains, checkpoint for human-in-the-loop

Предварительные знания

LangChain and LlamaIndex: Orchestrating LLM Pipelines

Anthropic codifies orchestration patterns

For the first two years of the LLM-app era, orchestration patterns were scattered across blog posts, talks, and framework code with no shared vocabulary. On December 19, 2024, Anthropic published Building Effective Agents, and that piece pulled the practice into a common language. The main distinction is between workflows, where several LLM calls are connected along predefined paths, and agents, where the model decides which steps to take on its own. The document named the building blocks: prompt chaining (a sequential chain), routing (classify a request and send it down the right branch), parallelization (run calls in parallel and aggregate), orchestrator-workers, and evaluator-optimizer. Anthropic's overall advice is to start with the simplest solution and add complexity only when it actually pays off.

Sequential Chain: a processing conveyor

A pipeline is a conveyor. Each step receives the output of the previous one. If step 3 of 7 fails - the entire chain stops. Sequential chain is the simplest and most common pattern, and also the most common culprit when an AI feature goes dark at 3 AM.

Where sequential chain runs in production right now:

**Content moderation pipeline** - classification → filtering → response generation (order is critical: no point spending `USD 0.01` on a response to a toxic request)
**Translation with quality check** - translate → assess quality → correct
**Code review bot** - parse diff → analyze against rules → generate comments
**Customer support** - classify ticket → extract entities → generate response

Sequential chain is the only pattern where **step order is critical**. Moderation must precede generation (no point spending money on a response to a toxic request). Classification must precede generation (the right system prompt is needed). If steps are independent - that's a signal to use the parallel pattern.

In a 4-step sequential chain, the third step returns an error. What happens?

Parallel / Fan-Out: concurrent processing

LLM calls are slow - from 500ms to 30s per request. GPT-4o TTFT averages around 800ms, Claude Sonnet around 1s. If two pipeline steps are independent and run sequentially - that's not architecture, it's wasted time. **Fan-out** launches multiple operations simultaneously and collects results at fan-in.

Classic example: analyzing a candidate's resume. Evaluating skills, culture fit, and red flags - three independent LLM calls. Running them sequentially (3.6s) when they can go parallel (1.5s) is architectural debt.

**Promise.allSettled vs Promise.all** - a critical choice. `Promise.all` fails on the first error, losing the results of successful tasks. `Promise.allSettled` returns all results: both successful and failed. For production pipelines with LLM - always use `allSettled`.

Advanced variant - **Fan-Out with Rate Limiting**. Launching 50 parallel LLM calls without any limit → API returns 429. Solution: concurrency control via `p-limit`.

With the fan-out pattern, 3 tasks run in parallel. Task A takes 200ms, B - 1500ms, C - 800ms. What is the total execution time?

Routing / Branching: conditional execution

Not all requests should go through the same pipeline. The question "What is 2+2?" doesn't need a RAG search across the knowledge base - that's a wasted call and wasted money. A customer complaint needs GPT-4o with an empathetic system prompt, while a FAQ query is fine with gpt-4o-mini at `USD 0.15/1M`. **Routing** is the conditional branch operator of AI architecture.

Two approaches to routing: **LLM-based** (a model-powered classifier) and **rule-based** (deterministic logic). The power is in the combination: rules close 30-40% of requests for free in 0ms, LLM kicks in only for ambiguous cases.

Conditional routing is the LangGraph state machine in its minimal form: the classifier determines the next graph state. LangGraph 0.2 (2024) makes these graphs explicit, with persist state between steps - the same pattern but with checkpoint and human-in-the-loop at branching points.

Why does the routing pattern often combine rule-based and LLM-based classification?

Map-Reduce: processing long documents

GPT-4o's context window is 128K tokens. Sounds like a lot. Apple's annual report is 200K+ tokens. A codebase at 50K lines is 500K+. A month of logs is millions. **Map-Reduce** is the only way to process data that simply doesn't fit in context. The same idea as Hadoop - except instead of MapReduce jobs, it's LLM calls.

Cursor uses this pattern to analyze large repositories: each file is summarized separately (MAP), then the summaries are combined to understand the architecture (REDUCE). Without map-reduce - a project with 1000 files is impossible to analyze in a single prompt.

Map-Reduce has several variations; the choice depends on the task:

Variation	How it works	When to use
Map-Reduce	Process each chunk → final merge	Summarization, extraction from long documents
Map-Rerank	Process each chunk → sort by score → pick best	Finding an answer in a long text
Refine	Chunk 1 → answer → chunk 2 + previous answer → refinement → ...	When coherence between parts matters
Collapse	Recursive reduce: if summaries are too long → reduce again	Very long documents (books, codebases)

The key decision in Map-Reduce is **which model to use for MAP and REDUCE**. MAP processes dozens of chunks - cost savings on the model matter here (gpt-4o-mini, `USD 0.15/1M`). REDUCE makes a single call with a critical result - a powerful model is justified here (gpt-4o, Claude Sonnet). With 67 chunks, the cost difference for the MAP phase between gpt-4o and gpt-4o-mini is 16x.

The **Refine** variation is an alternative to Map-Reduce when coherence matters:

A 200K-token document needs to be summarized. The model's context window is 128K. Which pattern fits?

Fallback Chains: pipeline resilience

In March 2024, the OpenAI API went down for 4 hours. Every application hardcoded to GPT-4 stopped. Applications with fallback chains switched to Anthropic Claude within seconds - users didn't notice. Orchestration is not about how to call an LLM. It's about how to **degrade gracefully, not crash hard**.

Fallback isn't an optimization. It's a baseline requirement for production. BullMQ with retry strategies, circuit breakers, cascade across multiple providers - all parts of the same idea: the system keeps working even when one component is unavailable.

Advanced strategy - the **Hedged Request**. Instead of waiting for a timeout, launch both primary and fallback in parallel and return the first response:

Hedged requests **double the cost** in the worst case. Use them only for critical paths where latency matters more than cost: real-time chatbots, trading signals, live customer support.

Summary table of orchestration patterns and when to use them:

Pattern	Latency	Cost	Use Case
Sequential	Sum(steps)	Sum(calls)	Dependent steps, processing pipelines
Parallel	Max(steps)	Sum(calls)	Independent tasks, multi-aspect analysis
Routing	Classify + 1 branch	Classify + 1 call	Different request types, cost optimization
Map-Reduce	Map (parallel) + Reduce	N × map + 1 reduce	Documents exceeding the context window
Fallback	Primary + retry/cascade	1-3 calls	Resilience, multi-provider
Hedged	Min(providers)	1-2 calls	Minimal latency for critical paths

Orchestration is just sequential LLM calls

Orchestration is state management, error handling, retry strategies, and partial failure in a distributed system where each node is an LLM call

Calling three LLMs in a row is a script. Orchestration starts when the system has to decide: what happens if step 2 of 5 fails? Save the intermediate result? Retry on a different model? Skip the step? Notify the user of degradation? LangGraph stores explicit state between steps precisely because without state management, agents lose context and loop. BullMQ for async orchestration provides retry, priority queues, and dead letter queue - everything that bare Promise.all doesn't have.

A production AI chatbot serves 10K users. The OpenAI API occasionally responds in 5+ seconds. Which fallback pattern will reduce P99 latency?

Orchestration is just sequential LLM calls

Orchestration is state management, error handling, retry, and partial failure. Calling three LLMs in a row is a script. Orchestration is when the system decides what to do when step 2 of 5 fails

LangGraph stores explicit state between steps precisely because without it, agents lose context and loop. BullMQ for async orchestration provides retry, priority queues, and dead letter queue - everything bare Promise.all doesn't have. The difference between a script and orchestration is the answer to: what happens when something goes wrong?

Summary

Sequential chain: a conveyor where each step depends on the previous. Fail on any step - the main vulnerability
Parallel (fan-out): independent tasks in parallel. Latency = max(tasks), not sum(tasks). Without concurrency control → rate limit 429
Routing: rule-based closes 30-40% of requests for free in 0ms, LLM-based handles the rest. The combination is the gold standard
Map-Reduce: the only way to process a document larger than the context window. MAP in parallel on gpt-4o-mini, REDUCE on gpt-4o
Fallback: cascade + retry + hedged requests. Production without fallback is a matter of time before an incident
Patterns combine: routing → sequential with parallel steps → fallback on every LLM call. LangGraph makes this an explicit graph

Вопросы для размышления

If step 4 of 6 in a sequential pipeline periodically fails with 503 from OpenAI - how should the pipeline be redesigned so it doesn't restart from step 1 every time? Hint: idempotency and checkpoint.
Cursor analyzes a repository of 2000 files via map-reduce. What pattern to choose for the MAP phase when files vary wildly in size - from 10 lines to 2000? How to avoid wasting money on micro-files?
A hedged request saves P99 latency at the cost of double token spend in 20% of cases. At what ratio of token cost to bad UX cost does it stop being worth it?

What's next

Here routing directed requests to different pipelines. Routing also works at the model level - sending simple requests to a cheap model and complex ones to a top-tier one. That's Model Routing - the next topic.

Model Routing — Automatic model selection (GPT-4o vs Claude vs local) based on complexity, cost, latency
Error Handling in LLM — Retry strategies, circuit breaker, graceful degradation for AI pipelines
Cost Management — Optimizing spend: routing, caching, prompt compression

Связанные уроки

aie-20-langchain-llamaindex — Frameworks provide primitives these patterns use
aie-22-model-routing — Routing is one orchestration pattern in depth
aie-32-error-handling-llm — Fallback and retry are reliability patterns
aie-29-cost-management — Orchestration choices directly drive cost
alg-19-divide-conquer — Map-reduce over LLM calls is divide and conquer
sd-09-message-queue — Parallel fan-out mirrors queue-based work distribution
net-55-message-queues