AI Engineering
Orchestration Patterns: routing, fallback, chain, map-reduce, branching
Цели урока
- Master sequential chain - a processing conveyor with context passing between steps
- Implement a parallel (fan-out) pipeline with concurrency control and rate limiting
- Build a routing/branching system combining rule-based and LLM-based classification
- Apply map-reduce for processing documents that exceed the context window
- Implement fallback chains with retry, timeout, cascade, and hedged requests
A pipeline is a conveyor. Each step receives the output of the previous one. If step 3 of 7 fails - the entire chain stops. That's how most AI systems work in production right now. Orchestration is about not crashing, but degrading gracefully: fan-out instead of sequential calls, map-reduce for 200K documents, hedged requests for P99 latency, BullMQ so async tasks don't vanish into void.
- Stripe AI - cascade fallback across 3 LLM providers, 99.99% availability - every OpenAI outage is invisible to customers
- Notion AI - parallel fan-out: analyzes a document simultaneously for structure, tone, and key ideas, saving 60% of time
- Linear AI - conditional routing: bug report → technical model with RAG over docs, feature request → product model without RAG
- Cursor (IDE) - map-reduce for repository analysis: each file summarized separately, then summaries merged for 500K+ token architecture understanding
- LangGraph 0.2 (2024) - state machine for agent orchestration: explicit graph instead of magic chains, checkpoint for human-in-the-loop
Предварительные знания
Anthropic codifies orchestration patterns
For the first two years of the LLM-app era, orchestration patterns were scattered across blog posts, talks, and framework code with no shared vocabulary. On December 19, 2024, Anthropic published Building Effective Agents, and that piece pulled the practice into a common language. The main distinction is between workflows, where several LLM calls are connected along predefined paths, and agents, where the model decides which steps to take on its own. The document named the building blocks: prompt chaining (a sequential chain), routing (classify a request and send it down the right branch), parallelization (run calls in parallel and aggregate), orchestrator-workers, and evaluator-optimizer. Anthropic's overall advice is to start with the simplest solution and add complexity only when it actually pays off.
Sequential Chain: a processing conveyor
A pipeline is a conveyor. Each step receives the output of the previous one. If step 3 of 7 fails - the entire chain stops. Sequential chain is the simplest and most common pattern, and also the most common culprit when an AI feature goes dark at 3 AM.
Where sequential chain runs in production right now:
- **Content moderation pipeline** - classification → filtering → response generation (order is critical: no point spending `USD 0.01` on a response to a toxic request)
- **Translation with quality check** - translate → assess quality → correct
- **Code review bot** - parse diff → analyze against rules → generate comments
- **Customer support** - classify ticket → extract entities → generate response
Sequential chain is the only pattern where **step order is critical**. Moderation must precede generation (no point spending money on a response to a toxic request). Classification must precede generation (the right system prompt is needed). If steps are independent - that's a signal to use the parallel pattern.
In a 4-step sequential chain, the third step returns an error. What happens?
Parallel / Fan-Out: concurrent processing
LLM calls are slow - from 500ms to 30s per request. GPT-4o TTFT averages around 800ms, Claude Sonnet around 1s. If two pipeline steps are independent and run sequentially - that's not architecture, it's wasted time. **Fan-out** launches multiple operations simultaneously and collects results at fan-in.
Classic example: analyzing a candidate's resume. Evaluating skills, culture fit, and red flags - three independent LLM calls. Running them sequentially (3.6s) when they can go parallel (1.5s) is architectural debt.
**Promise.allSettled vs Promise.all** - a critical choice. `Promise.all` fails on the first error, losing the results of successful tasks. `Promise.allSettled` returns all results: both successful and failed. For production pipelines with LLM - always use `allSettled`.
Advanced variant - **Fan-Out with Rate Limiting**. Launching 50 parallel LLM calls without any limit → API returns 429. Solution: concurrency control via `p-limit`.
With the fan-out pattern, 3 tasks run in parallel. Task A takes 200ms, B - 1500ms, C - 800ms. What is the total execution time?
Routing / Branching: conditional execution
Not all requests should go through the same pipeline. The question "What is 2+2?" doesn't need a RAG search across the knowledge base - that's a wasted call and wasted money. A customer complaint needs GPT-4o with an empathetic system prompt, while a FAQ query is fine with gpt-4o-mini at `USD 0.15/1M`. **Routing** is the conditional branch operator of AI architecture.
Two approaches to routing: **LLM-based** (a model-powered classifier) and **rule-based** (deterministic logic). The power is in the combination: rules close 30-40% of requests for free in 0ms, LLM kicks in only for ambiguous cases.
Conditional routing is the LangGraph state machine in its minimal form: the classifier determines the next graph state. LangGraph 0.2 (2024) makes these graphs explicit, with persist state between steps - the same pattern but with checkpoint and human-in-the-loop at branching points.
Why does the routing pattern often combine rule-based and LLM-based classification?
Map-Reduce: processing long documents
GPT-4o's context window is 128K tokens. Sounds like a lot. Apple's annual report is 200K+ tokens. A codebase at 50K lines is 500K+. A month of logs is millions. **Map-Reduce** is the only way to process data that simply doesn't fit in context. The same idea as Hadoop - except instead of MapReduce jobs, it's LLM calls.
Cursor uses this pattern to analyze large repositories: each file is summarized separately (MAP), then the summaries are combined to understand the architecture (REDUCE). Without map-reduce - a project with 1000 files is impossible to analyze in a single prompt.
Map-Reduce has several variations; the choice depends on the task:
| Variation | How it works | When to use |
|---|---|---|
| Map-Reduce | Process each chunk → final merge | Summarization, extraction from long documents |
| Map-Rerank | Process each chunk → sort by score → pick best | Finding an answer in a long text |
| Refine | Chunk 1 → answer → chunk 2 + previous answer → refinement → ... | When coherence between parts matters |
| Collapse | Recursive reduce: if summaries are too long → reduce again | Very long documents (books, codebases) |
The key decision in Map-Reduce is **which model to use for MAP and REDUCE**. MAP processes dozens of chunks - cost savings on the model matter here (gpt-4o-mini, `USD 0.15/1M`). REDUCE makes a single call with a critical result - a powerful model is justified here (gpt-4o, Claude Sonnet). With 67 chunks, the cost difference for the MAP phase between gpt-4o and gpt-4o-mini is 16x.
The **Refine** variation is an alternative to Map-Reduce when coherence matters:
A 200K-token document needs to be summarized. The model's context window is 128K. Which pattern fits?
Fallback Chains: pipeline resilience
In March 2024, the OpenAI API went down for 4 hours. Every application hardcoded to GPT-4 stopped. Applications with fallback chains switched to Anthropic Claude within seconds - users didn't notice. Orchestration is not about how to call an LLM. It's about how to **degrade gracefully, not crash hard**.
Fallback isn't an optimization. It's a baseline requirement for production. BullMQ with retry strategies, circuit breakers, cascade across multiple providers - all parts of the same idea: the system keeps working even when one component is unavailable.
Advanced strategy - the **Hedged Request**. Instead of waiting for a timeout, launch both primary and fallback in parallel and return the first response:
Hedged requests **double the cost** in the worst case. Use them only for critical paths where latency matters more than cost: real-time chatbots, trading signals, live customer support.
Summary table of orchestration patterns and when to use them:
| Pattern | Latency | Cost | Use Case |
|---|---|---|---|
| Sequential | Sum(steps) | Sum(calls) | Dependent steps, processing pipelines |
| Parallel | Max(steps) | Sum(calls) | Independent tasks, multi-aspect analysis |
| Routing | Classify + 1 branch | Classify + 1 call | Different request types, cost optimization |
| Map-Reduce | Map (parallel) + Reduce | N × map + 1 reduce | Documents exceeding the context window |
| Fallback | Primary + retry/cascade | 1-3 calls | Resilience, multi-provider |
| Hedged | Min(providers) | 1-2 calls | Minimal latency for critical paths |
Orchestration is just sequential LLM calls
Orchestration is state management, error handling, retry strategies, and partial failure in a distributed system where each node is an LLM call
Calling three LLMs in a row is a script. Orchestration starts when the system has to decide: what happens if step 2 of 5 fails? Save the intermediate result? Retry on a different model? Skip the step? Notify the user of degradation? LangGraph stores explicit state between steps precisely because without state management, agents lose context and loop. BullMQ for async orchestration provides retry, priority queues, and dead letter queue - everything that bare Promise.all doesn't have.
A production AI chatbot serves 10K users. The OpenAI API occasionally responds in 5+ seconds. Which fallback pattern will reduce P99 latency?
Orchestration is just sequential LLM calls
Orchestration is state management, error handling, retry, and partial failure. Calling three LLMs in a row is a script. Orchestration is when the system decides what to do when step 2 of 5 fails
LangGraph stores explicit state between steps precisely because without it, agents lose context and loop. BullMQ for async orchestration provides retry, priority queues, and dead letter queue - everything bare Promise.all doesn't have. The difference between a script and orchestration is the answer to: what happens when something goes wrong?
Summary
- Sequential chain: a conveyor where each step depends on the previous. Fail on any step - the main vulnerability
- Parallel (fan-out): independent tasks in parallel. Latency = max(tasks), not sum(tasks). Without concurrency control → rate limit 429
- Routing: rule-based closes 30-40% of requests for free in 0ms, LLM-based handles the rest. The combination is the gold standard
- Map-Reduce: the only way to process a document larger than the context window. MAP in parallel on gpt-4o-mini, REDUCE on gpt-4o
- Fallback: cascade + retry + hedged requests. Production without fallback is a matter of time before an incident
- Patterns combine: routing → sequential with parallel steps → fallback on every LLM call. LangGraph makes this an explicit graph
Вопросы для размышления
- If step 4 of 6 in a sequential pipeline periodically fails with 503 from OpenAI - how should the pipeline be redesigned so it doesn't restart from step 1 every time? Hint: idempotency and checkpoint.
- Cursor analyzes a repository of 2000 files via map-reduce. What pattern to choose for the MAP phase when files vary wildly in size - from 10 lines to 2000? How to avoid wasting money on micro-files?
- A hedged request saves P99 latency at the cost of double token spend in 20% of cases. At what ratio of token cost to bad UX cost does it stop being worth it?
What's next
Here routing directed requests to different pipelines. Routing also works at the model level - sending simple requests to a cheap model and complex ones to a top-tier one. That's Model Routing - the next topic.
- Model Routing — Automatic model selection (GPT-4o vs Claude vs local) based on complexity, cost, latency
- Error Handling in LLM — Retry strategies, circuit breaker, graceful degradation for AI pipelines
- Cost Management — Optimizing spend: routing, caching, prompt compression
Связанные уроки
- aie-20-langchain-llamaindex — Frameworks provide primitives these patterns use
- aie-22-model-routing — Routing is one orchestration pattern in depth
- aie-32-error-handling-llm — Fallback and retry are reliability patterns
- aie-29-cost-management — Orchestration choices directly drive cost
- alg-19-divide-conquer — Map-reduce over LLM calls is divide and conquer
- sd-09-message-queue — Parallel fan-out mirrors queue-based work distribution
- net-55-message-queues