AI Engineering
Multi-Agent Systems: Orchestration, Communication, Agent Specialization
Цели урока
- Understand when a multi-agent system is justified and when a single agent is sufficient
- Master communication patterns: supervisor, peer-to-peer, hierarchical
- Design specialized agents: researcher, coder, reviewer with different models
- Implement orchestration via pipeline and dynamic routing (supervisor)
- Optimize production metrics: cost, latency, debugging
Предварительные знания
- Agent frameworks: LangGraph, Vercel AI SDK
- Agent fundamentals: ReAct, planning, memory
From Generative Agents to AutoGen and CrewAI
The idea that several LLM agents can work together took shape in 2023. In April 2023 Joon Sung Park and colleagues at Stanford and Google published Generative Agents: Interactive Simulacra of Human Behavior: 25 agents lived in a small virtual town, planned their days, talked, remembered events, and formed relationships. The work showed that believable behavior emerges from memory, reflection, and planning, and it was presented at the UIST conference in the fall of 2023. In August 2023 Microsoft Research released AutoGen (arXiv paper 2308.08155), a framework where a task is solved through a conversation among several agents, with hooks for a human in the loop and for code execution. The same year João Moura built CrewAI with a focus on roles: researcher, writer, editor. The shared theme across all three projects is the orchestration of specialized agents instead of one do-it-all agent.
One agent with GPT-4o - `USD 0.80` per task. Five specialized agents with gpt-4o-mini - `USD 0.12` for the same task, three times faster (running in parallel). Multi-agent is not about complexity - it is about specialization and cost. GitHub Copilot Workspace proved this: a pipeline of planner, coder, and reviewer achieves 40% acceptance rate on real issues. A single agent with identical tools - 15%. The gap is not in intelligence. It is in architecture.
- GitHub Copilot Workspace - pipeline of planner, coder, reviewer: 40% acceptance rate vs 15% for a single agent
- ChatGPT Deep Research - orchestrator-worker pattern: researchers search 50+ sources in parallel, synthesizer assembles the report
- Cursor Composer - shared state via codebase context: one agent reads files, another edits, a third runs tests
- Salesforce Agentforce - supervisor pattern in enterprise: sales agent, support agent, analytics agent on shared CRM data via message passing
Why Multiple Agents: The Limits of a Single Agent
One agent with GPT-4o - `USD 0.80` per task. Five specialized agents with gpt-4o-mini - `USD 0.12` for the same task, three times faster (running in parallel). Multi-agent is not about complexity - it is about specialization and cost.
Microsoft Research (2024) verified this empirically: with 20+ tools, single-agent accuracy drops to 64%. Split into 3 specialized agents with 7 tools each - it rises to 89%. Not because three heads are smarter than one. Because each agent gets a clean context and a minimal tool surface to navigate.
| Single agent problem | How multi-agent solves it |
|---|---|
| Context window overflows during long tasks | Each agent has its own context - focused on its subtask |
| 20+ tools - the model gets confused choosing | Each agent has 3-7 tools - specialization |
| One perspective - bias in the solution | Different agents - different "opinions" - debate/consensus |
| An error in one step breaks everything | A reviewer agent checks the work of others |
| Cannot use different models for different tasks | Researcher on GPT-4o (accuracy), Writer on Claude (style) |
A multi-agent system is not "just several agents." It is an **interaction architecture**: who communicates with whom, who makes decisions, how data flows between agents, how errors propagate through the chain. CrewAI, AutoGen, LangGraph - each framework implements this architecture differently. Without a clear design - chaos: agents duplicate work, loop, contradict each other.
Multi-agent is not always better. Each additional agent means additional API calls ($), latency, and failure points. For the task "answer a customer question," a single agent with 5 tools is more optimal than 3 agents passing context to each other.
Why is splitting into 3 specialized agents with 7 tools each more effective than one agent with 21 tools?
Communication Patterns: Supervisor, Peer-to-Peer, Hierarchical
How agents communicate defines the behavior of the entire system. Three patterns. **Supervisor**: one central LLM-router, all others are workers - the standard structure in production LangGraph pipelines. **Peer-to-peer**: agents transfer control via handoff - OpenAI Swarm, Vercel AI SDK. **Hierarchical**: a tree of managers and sub-teams - for enterprise systems with 10+ agents.
| Pattern | Pros | Cons | When to use |
|---|---|---|---|
| Supervisor | Centralized control, easy to track the flow | Supervisor is a bottleneck and single point of failure | Clear task hierarchy, 3-5 agents |
| Peer-to-Peer | Flexibility, no bottleneck, agents adapt | Harder to track, risk of looping | Dynamic tasks, chatbot routing |
| Hierarchical | Scales to large teams | Complex implementation, high cost | 10+ agents, enterprise pipelines |
In the supervisor pattern, after the researcher finishes work, what happens next?
Specialized Agents: Researcher, Coder, Reviewer
A surgical team works not because every surgeon is brilliant. But because the anesthesiologist never picks up the scalpel, and the surgeon never monitors blood pressure. Division of responsibility is the source of effectiveness. Same with agents: each gets its own system prompt, its own tools, its own model.
Researcher on GPT-4o (`USD 2.50/1M` tokens) - because it needs strong reasoning to analyze contradictory sources. Reviewer on gpt-4o-mini (`USD 0.15/1M`) - because it works from a checklist, powerful reasoning is unnecessary. This is **model routing by strengths**: 16x price difference while maintaining quality at critical stages.
| Agent role | Model | Tools | System prompt focus |
|---|---|---|---|
| Researcher | GPT-4o (strong reasoning) | webSearch, readUrl, arxivSearch | Accuracy, sources, structure |
| Coder | Claude Sonnet (strong code) | writeFile, runTests, readFile | Typing, production-ready, tests |
| Reviewer | GPT-4o | lintCode, securityScan | Security, performance, edge cases |
| Writer | Claude Sonnet (good style) | no tools (text only) | Clarity, structure, tone of voice |
| Planner | GPT-4o / o1 | no tools (reasoning only) | Decomposition, prioritization, dependencies |
Using different models for different agents is a powerful optimization. Researcher and Reviewer can use GPT-4o (`USD 2.50/1M` input tokens), while for simple tasks like routing or summarization GPT-4o-mini (`USD 0.15/1M`) will do - 16x savings while maintaining quality at critical stages.
The three-tool rule: if an agent needs more than 10 tools - it's a signal that the agent is doing too much. Split it into two specialized ones. The optimal zone: 3-7 tools per agent.
For the coder agent in a multi-agent system, the model Claude Sonnet is chosen, and for the researcher agent - GPT-4o. What is the reason for this choice?
Orchestration: Pipeline, Handoff Protocol, Shared State
Agents are defined. Now the question shifts to a different level: how data flows between them, how shared state via Redis synchronizes parallel workers, how an error in one agent propagates through the entire system. Two main approaches: **pipeline** (sequential conveyor with a fixed order) and **dynamic routing** via supervisor, deciding on the fly.
| Aspect | Pipeline | Dynamic Routing (Supervisor) |
|---|---|---|
| Agent order | Fixed: A → B → C | Dynamic: supervisor decides |
| Flexibility | Low - always the same sequence | High - can skip steps, go back |
| Predictability | High - always know what comes next | Medium - depends on supervisor decisions |
| Cost | Fixed (N agent calls) | Variable (supervisor + N agent calls) |
| Complexity | Simple (linear) | Medium (graph with conditions) |
| When to use | Clear workflow: research → writing → editing | Uncertain tasks with branching |
Start with a pipeline - it's simpler, more predictable, and cheaper. Switch to the supervisor pattern only when the pipeline can't handle it: tasks require branching, skipping steps, or dynamic agent ordering.
The reviewer found a critical bug in the code. In the supervisor pattern, what happens next?
Production: Cost, Latency, Debugging Multi-Agent Systems
A multi-agent system looks impressive in a demo. In production, numbers surface: supervisor + 3 agents x 5 steps = 20 API calls per task. At `USD 2.50/1M` tokens for GPT-4o that is `USD 0.80` per request, which cost `USD 0.15` with a single agent. Each new agent adds not just API cost, but also a new failure point and +5-10 seconds of latency. Without factoring this in - an expensive toy, not a production tool.
Debugging is even harder. An error in the final output could have entered at any stage: the researcher found incorrect data, the supervisor delegated wrong, the coder ignored context, the reviewer missed a bug. Without tracing through LangSmith or OpenTelemetry - finding a needle in a haystack. **Error propagation between agents** is the first problem teams hit in production.
| Metric | Single Agent | Multi-Agent (3) | Multi-Agent (5) |
|---|---|---|---|
| API calls per task | 3-5 | 10-20 | 20-40 |
| Cost (GPT-4o) | USD 0.02-0.05 | USD 0.10-0.30 | USD 0.30-0.80 |
| Latency | 5-15 sec | 20-60 sec | 60-180 sec |
| Failure points | 1 | 4 (supervisor + 3) | 6 (supervisor + 5) |
| Debugging complexity | Low | Medium | High |
- **Tracing** - tools like LangSmith, Arize Phoenix, or OpenTelemetry show the full path of a request through agents: who decided what, which tools were called, how much it cost
- **Replay** - the ability to "replay" a session with a modified prompt for one agent, without restarting the entire system
- **A/B testing** - comparing configurations: 3 agents vs 5 agents, GPT-4o vs Claude, supervisor vs pipeline
- **Graceful degradation** - if one agent goes down (rate limit, timeout), the system continues working at reduced quality rather than crashing entirely
Rule for production: start with the **minimum number of agents** and only add new ones when metrics show the need. 2 well-tuned agents will give a better result than 5 hastily configured ones. Each new agent is +`USD 0.05`-0.10 per task and +5-10 seconds of latency.
Setting temperature=0 makes a multi-agent pipeline fully deterministic and reproducible
Deterministic individual agents do not produce a deterministic pipeline. Sources of non-determinism: tool call ordering with parallel agents, race conditions in shared state, non-deterministic retrieval results from vector databases, and model API changes between versions
Reproducibility in multi-agent systems requires deterministic orchestration logic, not just deterministic model outputs. Pipeline-level determinism is a separate concern from token-level determinism
A multi-agent system uses GPT-4o (USD 2.50/1M tokens) for all 5 agents. The cost per task is USD 0.80. Which optimization will yield the greatest savings?
More agents = smarter system
More agents = more failure points, higher cost, harder debugging. Errors multiply as they propagate through the agent chain
Every agent adds latency (5-10 sec), cost (`USD 0.05`-0.10), and a probability of error. An error from the researcher gets passed to the coder as fact, the coder passes it to the reviewer as working code - error propagation travels the full chain. 2 well-configured agents consistently outperform 5 hastily added ones. Coordination is expensive - in tokens, time, and debugging complexity.
A multi-agent system always outperforms a single agent
Multi-agent is only justified with 20+ tools, complex branching tasks, or when one agent needs to verify another's work
For "answer a customer question" - a single agent with 5 tools and a sharp system prompt wins: no context-passing overhead, no supervisor latency, no multiplied cost. Architectural decisions should be driven by metrics, not intuition.
Key Takeaways
- Multi-agent is justified with 20+ tools, complex tasks, need for different perspectives - not for simple bots
- Supervisor pattern - centralized control via LangGraph; peer-to-peer via handoff (OpenAI Swarm, Vercel AI SDK); hierarchical - for 10+ agents
- Specialization: each agent - its own system prompt, 3-7 tools, optimal model (GPT-4o for reasoning, Claude for code)
- Pipeline for predictable workflows, dynamic routing for branching tasks - always start with a pipeline
- Errors multiply through the agent chain - tracing via LangSmith is mandatory; model routing to gpt-4o-mini for supervisor cuts cost by 40-50%
Вопросы для размышления
- A multi-agent pipeline produces inconsistent results across runs despite deterministic agents (temperature=0). What are the three most likely causes, and how would each be diagnosed?
What's Next
Multi-agent systems are the top level of abstraction in AI Engineering. Next - practical skills: how to evaluate AI system quality, how to manage costs in production, how to ensure security.
- AI Evaluation and Testing — How to measure multi-agent system quality: accuracy, coherence, task completion rate
- Cost Management — Cost optimization strategies: model routing, caching, prompt compression
- AI Safety and Security — Prompt injection, tool abuse, data leakage - protecting multi-agent systems
Связанные уроки
- aie-18-agent-frameworks — Multi-agent systems are built on agent frameworks
- aie-31-evaluation — Multi-agent quality needs measurable evaluation
- aie-34-prompt-injection-deep — More agents means a larger attack surface
- aie-29-cost-management — Multiple agents multiply token cost
- sd-10-microservices — Specialized agents mirror specialized microservices
- ml-48-rl-intro — Agent coordination relates to multi-agent reinforcement learning
- net-55-message-queues
- net-53-distributed-intro