AI Engineering

Multi-Agent Systems: Orchestration, Communication, Agent Specialization

Цели урока

Understand when a multi-agent system is justified and when a single agent is sufficient
Master communication patterns: supervisor, peer-to-peer, hierarchical
Design specialized agents: researcher, coder, reviewer with different models
Implement orchestration via pipeline and dynamic routing (supervisor)
Optimize production metrics: cost, latency, debugging

Предварительные знания

Agent frameworks: LangGraph, Vercel AI SDK
Agent fundamentals: ReAct, planning, memory

From Generative Agents to AutoGen and CrewAI

The idea that several LLM agents can work together took shape in 2023. In April 2023 Joon Sung Park and colleagues at Stanford and Google published Generative Agents: Interactive Simulacra of Human Behavior: 25 agents lived in a small virtual town, planned their days, talked, remembered events, and formed relationships. The work showed that believable behavior emerges from memory, reflection, and planning, and it was presented at the UIST conference in the fall of 2023. In August 2023 Microsoft Research released AutoGen (arXiv paper 2308.08155), a framework where a task is solved through a conversation among several agents, with hooks for a human in the loop and for code execution. The same year João Moura built CrewAI with a focus on roles: researcher, writer, editor. The shared theme across all three projects is the orchestration of specialized agents instead of one do-it-all agent.

One agent with GPT-4o - `USD 0.80` per task. Five specialized agents with gpt-4o-mini - `USD 0.12` for the same task, three times faster (running in parallel). Multi-agent is not about complexity - it is about specialization and cost. GitHub Copilot Workspace proved this: a pipeline of planner, coder, and reviewer achieves 40% acceptance rate on real issues. A single agent with identical tools - 15%. The gap is not in intelligence. It is in architecture.

GitHub Copilot Workspace - pipeline of planner, coder, reviewer: 40% acceptance rate vs 15% for a single agent
ChatGPT Deep Research - orchestrator-worker pattern: researchers search 50+ sources in parallel, synthesizer assembles the report
Cursor Composer - shared state via codebase context: one agent reads files, another edits, a third runs tests
Salesforce Agentforce - supervisor pattern in enterprise: sales agent, support agent, analytics agent on shared CRM data via message passing

Why Multiple Agents: The Limits of a Single Agent

Microsoft Research (2024) verified this empirically: with 20+ tools, single-agent accuracy drops to 64%. Split into 3 specialized agents with 7 tools each - it rises to 89%. Not because three heads are smarter than one. Because each agent gets a clean context and a minimal tool surface to navigate.

Single agent problem	How multi-agent solves it
Context window overflows during long tasks	Each agent has its own context - focused on its subtask
20+ tools - the model gets confused choosing	Each agent has 3-7 tools - specialization
One perspective - bias in the solution	Different agents - different "opinions" - debate/consensus
An error in one step breaks everything	A reviewer agent checks the work of others
Cannot use different models for different tasks	Researcher on GPT-4o (accuracy), Writer on Claude (style)

A multi-agent system is not "just several agents." It is an **interaction architecture**: who communicates with whom, who makes decisions, how data flows between agents, how errors propagate through the chain. CrewAI, AutoGen, LangGraph - each framework implements this architecture differently. Without a clear design - chaos: agents duplicate work, loop, contradict each other.

Multi-agent is not always better. Each additional agent means additional API calls ($), latency, and failure points. For the task "answer a customer question," a single agent with 5 tools is more optimal than 3 agents passing context to each other.

Why is splitting into 3 specialized agents with 7 tools each more effective than one agent with 21 tools?

Communication Patterns: Supervisor, Peer-to-Peer, Hierarchical

How agents communicate defines the behavior of the entire system. Three patterns. **Supervisor**: one central LLM-router, all others are workers - the standard structure in production LangGraph pipelines. **Peer-to-peer**: agents transfer control via handoff - OpenAI Swarm, Vercel AI SDK. **Hierarchical**: a tree of managers and sub-teams - for enterprise systems with 10+ agents.

Pattern	Pros	Cons	When to use
Supervisor	Centralized control, easy to track the flow	Supervisor is a bottleneck and single point of failure	Clear task hierarchy, 3-5 agents
Peer-to-Peer	Flexibility, no bottleneck, agents adapt	Harder to track, risk of looping	Dynamic tasks, chatbot routing
Hierarchical	Scales to large teams	Complex implementation, high cost	10+ agents, enterprise pipelines

In the supervisor pattern, after the researcher finishes work, what happens next?

Specialized Agents: Researcher, Coder, Reviewer

A surgical team works not because every surgeon is brilliant. But because the anesthesiologist never picks up the scalpel, and the surgeon never monitors blood pressure. Division of responsibility is the source of effectiveness. Same with agents: each gets its own system prompt, its own tools, its own model.

Researcher on GPT-4o (`USD 2.50/1M` tokens) - because it needs strong reasoning to analyze contradictory sources. Reviewer on gpt-4o-mini (`USD 0.15/1M`) - because it works from a checklist, powerful reasoning is unnecessary. This is **model routing by strengths**: 16x price difference while maintaining quality at critical stages.

Agent role	Model	Tools	System prompt focus
Researcher	GPT-4o (strong reasoning)	webSearch, readUrl, arxivSearch	Accuracy, sources, structure
Coder	Claude Sonnet (strong code)	writeFile, runTests, readFile	Typing, production-ready, tests
Reviewer	GPT-4o	lintCode, securityScan	Security, performance, edge cases
Writer	Claude Sonnet (good style)	no tools (text only)	Clarity, structure, tone of voice
Planner	GPT-4o / o1	no tools (reasoning only)	Decomposition, prioritization, dependencies

Using different models for different agents is a powerful optimization. Researcher and Reviewer can use GPT-4o (`USD 2.50/1M` input tokens), while for simple tasks like routing or summarization GPT-4o-mini (`USD 0.15/1M`) will do - 16x savings while maintaining quality at critical stages.

The three-tool rule: if an agent needs more than 10 tools - it's a signal that the agent is doing too much. Split it into two specialized ones. The optimal zone: 3-7 tools per agent.

For the coder agent in a multi-agent system, the model Claude Sonnet is chosen, and for the researcher agent - GPT-4o. What is the reason for this choice?

Orchestration: Pipeline, Handoff Protocol, Shared State

Agents are defined. Now the question shifts to a different level: how data flows between them, how shared state via Redis synchronizes parallel workers, how an error in one agent propagates through the entire system. Two main approaches: **pipeline** (sequential conveyor with a fixed order) and **dynamic routing** via supervisor, deciding on the fly.

Aspect	Pipeline	Dynamic Routing (Supervisor)
Agent order	Fixed: A → B → C	Dynamic: supervisor decides
Flexibility	Low - always the same sequence	High - can skip steps, go back
Predictability	High - always know what comes next	Medium - depends on supervisor decisions
Cost	Fixed (N agent calls)	Variable (supervisor + N agent calls)
Complexity	Simple (linear)	Medium (graph with conditions)
When to use	Clear workflow: research → writing → editing	Uncertain tasks with branching

Start with a pipeline - it's simpler, more predictable, and cheaper. Switch to the supervisor pattern only when the pipeline can't handle it: tasks require branching, skipping steps, or dynamic agent ordering.

The reviewer found a critical bug in the code. In the supervisor pattern, what happens next?

Production: Cost, Latency, Debugging Multi-Agent Systems

A multi-agent system looks impressive in a demo. In production, numbers surface: supervisor + 3 agents x 5 steps = 20 API calls per task. At `USD 2.50/1M` tokens for GPT-4o that is `USD 0.80` per request, which cost `USD 0.15` with a single agent. Each new agent adds not just API cost, but also a new failure point and +5-10 seconds of latency. Without factoring this in - an expensive toy, not a production tool.

Debugging is even harder. An error in the final output could have entered at any stage: the researcher found incorrect data, the supervisor delegated wrong, the coder ignored context, the reviewer missed a bug. Without tracing through LangSmith or OpenTelemetry - finding a needle in a haystack. **Error propagation between agents** is the first problem teams hit in production.

Metric	Single Agent	Multi-Agent (3)	Multi-Agent (5)
API calls per task	3-5	10-20	20-40
Cost (GPT-4o)	USD 0.02-0.05	USD 0.10-0.30	USD 0.30-0.80
Latency	5-15 sec	20-60 sec	60-180 sec
Failure points	1	4 (supervisor + 3)	6 (supervisor + 5)
Debugging complexity	Low	Medium	High

**Tracing** - tools like LangSmith, Arize Phoenix, or OpenTelemetry show the full path of a request through agents: who decided what, which tools were called, how much it cost
**Replay** - the ability to "replay" a session with a modified prompt for one agent, without restarting the entire system
**A/B testing** - comparing configurations: 3 agents vs 5 agents, GPT-4o vs Claude, supervisor vs pipeline
**Graceful degradation** - if one agent goes down (rate limit, timeout), the system continues working at reduced quality rather than crashing entirely

Rule for production: start with the **minimum number of agents** and only add new ones when metrics show the need. 2 well-tuned agents will give a better result than 5 hastily configured ones. Each new agent is +`USD 0.05`-0.10 per task and +5-10 seconds of latency.

Setting temperature=0 makes a multi-agent pipeline fully deterministic and reproducible

Deterministic individual agents do not produce a deterministic pipeline. Sources of non-determinism: tool call ordering with parallel agents, race conditions in shared state, non-deterministic retrieval results from vector databases, and model API changes between versions

Reproducibility in multi-agent systems requires deterministic orchestration logic, not just deterministic model outputs. Pipeline-level determinism is a separate concern from token-level determinism

A multi-agent system uses GPT-4o (USD 2.50/1M tokens) for all 5 agents. The cost per task is USD 0.80. Which optimization will yield the greatest savings?

More agents = smarter system

More agents = more failure points, higher cost, harder debugging. Errors multiply as they propagate through the agent chain

Every agent adds latency (5-10 sec), cost (`USD 0.05`-0.10), and a probability of error. An error from the researcher gets passed to the coder as fact, the coder passes it to the reviewer as working code - error propagation travels the full chain. 2 well-configured agents consistently outperform 5 hastily added ones. Coordination is expensive - in tokens, time, and debugging complexity.

A multi-agent system always outperforms a single agent

Multi-agent is only justified with 20+ tools, complex branching tasks, or when one agent needs to verify another's work

For "answer a customer question" - a single agent with 5 tools and a sharp system prompt wins: no context-passing overhead, no supervisor latency, no multiplied cost. Architectural decisions should be driven by metrics, not intuition.

Key Takeaways

Multi-agent is justified with 20+ tools, complex tasks, need for different perspectives - not for simple bots
Supervisor pattern - centralized control via LangGraph; peer-to-peer via handoff (OpenAI Swarm, Vercel AI SDK); hierarchical - for 10+ agents
Specialization: each agent - its own system prompt, 3-7 tools, optimal model (GPT-4o for reasoning, Claude for code)
Pipeline for predictable workflows, dynamic routing for branching tasks - always start with a pipeline
Errors multiply through the agent chain - tracing via LangSmith is mandatory; model routing to gpt-4o-mini for supervisor cuts cost by 40-50%

Вопросы для размышления

A multi-agent pipeline produces inconsistent results across runs despite deterministic agents (temperature=0). What are the three most likely causes, and how would each be diagnosed?

What's Next

Multi-agent systems are the top level of abstraction in AI Engineering. Next - practical skills: how to evaluate AI system quality, how to manage costs in production, how to ensure security.

AI Evaluation and Testing — How to measure multi-agent system quality: accuracy, coherence, task completion rate
Cost Management — Cost optimization strategies: model routing, caching, prompt compression
AI Safety and Security — Prompt injection, tool abuse, data leakage - protecting multi-agent systems

Связанные уроки

aie-18-agent-frameworks — Multi-agent systems are built on agent frameworks
aie-31-evaluation — Multi-agent quality needs measurable evaluation
aie-34-prompt-injection-deep — More agents means a larger attack surface
aie-29-cost-management — Multiple agents multiply token cost
sd-10-microservices — Specialized agents mirror specialized microservices
ml-48-rl-intro — Agent coordination relates to multi-agent reinforcement learning
net-55-message-queues
net-53-distributed-intro