AI Engineering

Multi-Agent Systems: Orchestration, Communication, Agent Specialization

Цели урока

  • Understand when a multi-agent system is justified and when a single agent is sufficient
  • Master communication patterns: supervisor, peer-to-peer, hierarchical
  • Design specialized agents: researcher, coder, reviewer with different models
  • Implement orchestration via pipeline and dynamic routing (supervisor)
  • Optimize production metrics: cost, latency, debugging

Предварительные знания

  • Agent frameworks: LangGraph, Vercel AI SDK
  • Agent fundamentals: ReAct, planning, memory
  • Agent Frameworks
  • Agent Fundamentals

From Generative Agents to AutoGen and CrewAI

The idea that several LLM agents can work together took shape in 2023. In April 2023 Joon Sung Park and colleagues at Stanford and Google published Generative Agents: Interactive Simulacra of Human Behavior: 25 agents lived in a small virtual town, planned their days, talked, remembered events, and formed relationships. The work showed that believable behavior emerges from memory, reflection, and planning, and it was presented at the UIST conference in the fall of 2023. In August 2023 Microsoft Research released AutoGen (arXiv paper 2308.08155), a framework where a task is solved through a conversation among several agents, with hooks for a human in the loop and for code execution. The same year João Moura built CrewAI with a focus on roles: researcher, writer, editor. The shared theme across all three projects is the orchestration of specialized agents instead of one do-it-all agent.

One agent with GPT-4o - `USD 0.80` per task. Five specialized agents with gpt-4o-mini - `USD 0.12` for the same task, three times faster (running in parallel). Multi-agent is not about complexity - it is about specialization and cost. GitHub Copilot Workspace proved this: a pipeline of planner, coder, and reviewer achieves 40% acceptance rate on real issues. A single agent with identical tools - 15%. The gap is not in intelligence. It is in architecture.

  • GitHub Copilot Workspace - pipeline of planner, coder, reviewer: 40% acceptance rate vs 15% for a single agent
  • ChatGPT Deep Research - orchestrator-worker pattern: researchers search 50+ sources in parallel, synthesizer assembles the report
  • Cursor Composer - shared state via codebase context: one agent reads files, another edits, a third runs tests
  • Salesforce Agentforce - supervisor pattern in enterprise: sales agent, support agent, analytics agent on shared CRM data via message passing

Why Multiple Agents: The Limits of a Single Agent

One agent with GPT-4o - `USD 0.80` per task. Five specialized agents with gpt-4o-mini - `USD 0.12` for the same task, three times faster (running in parallel). Multi-agent is not about complexity - it is about specialization and cost.

Microsoft Research (2024) verified this empirically: with 20+ tools, single-agent accuracy drops to 64%. Split into 3 specialized agents with 7 tools each - it rises to 89%. Not because three heads are smarter than one. Because each agent gets a clean context and a minimal tool surface to navigate.

Single agent problemHow multi-agent solves it
Context window overflows during long tasksEach agent has its own context - focused on its subtask
20+ tools - the model gets confused choosingEach agent has 3-7 tools - specialization
One perspective - bias in the solutionDifferent agents - different "opinions" - debate/consensus
An error in one step breaks everythingA reviewer agent checks the work of others
Cannot use different models for different tasksResearcher on GPT-4o (accuracy), Writer on Claude (style)

A multi-agent system is not "just several agents." It is an **interaction architecture**: who communicates with whom, who makes decisions, how data flows between agents, how errors propagate through the chain. CrewAI, AutoGen, LangGraph - each framework implements this architecture differently. Without a clear design - chaos: agents duplicate work, loop, contradict each other.

Multi-agent is not always better. Each additional agent means additional API calls ($), latency, and failure points. For the task "answer a customer question," a single agent with 5 tools is more optimal than 3 agents passing context to each other.

Why is splitting into 3 specialized agents with 7 tools each more effective than one agent with 21 tools?

Communication Patterns: Supervisor, Peer-to-Peer, Hierarchical

How agents communicate defines the behavior of the entire system. Three patterns. **Supervisor**: one central LLM-router, all others are workers - the standard structure in production LangGraph pipelines. **Peer-to-peer**: agents transfer control via handoff - OpenAI Swarm, Vercel AI SDK. **Hierarchical**: a tree of managers and sub-teams - for enterprise systems with 10+ agents.

PatternProsConsWhen to use
SupervisorCentralized control, easy to track the flowSupervisor is a bottleneck and single point of failureClear task hierarchy, 3-5 agents
Peer-to-PeerFlexibility, no bottleneck, agents adaptHarder to track, risk of loopingDynamic tasks, chatbot routing
HierarchicalScales to large teamsComplex implementation, high cost10+ agents, enterprise pipelines

In the supervisor pattern, after the researcher finishes work, what happens next?

Specialized Agents: Researcher, Coder, Reviewer

A surgical team works not because every surgeon is brilliant. But because the anesthesiologist never picks up the scalpel, and the surgeon never monitors blood pressure. Division of responsibility is the source of effectiveness. Same with agents: each gets its own system prompt, its own tools, its own model.

Researcher on GPT-4o (`USD 2.50/1M` tokens) - because it needs strong reasoning to analyze contradictory sources. Reviewer on gpt-4o-mini (`USD 0.15/1M`) - because it works from a checklist, powerful reasoning is unnecessary. This is **model routing by strengths**: 16x price difference while maintaining quality at critical stages.

Agent roleModelToolsSystem prompt focus
ResearcherGPT-4o (strong reasoning)webSearch, readUrl, arxivSearchAccuracy, sources, structure
CoderClaude Sonnet (strong code)writeFile, runTests, readFileTyping, production-ready, tests
ReviewerGPT-4olintCode, securityScanSecurity, performance, edge cases
WriterClaude Sonnet (good style)no tools (text only)Clarity, structure, tone of voice
PlannerGPT-4o / o1no tools (reasoning only)Decomposition, prioritization, dependencies

Using different models for different agents is a powerful optimization. Researcher and Reviewer can use GPT-4o (`USD 2.50/1M` input tokens), while for simple tasks like routing or summarization GPT-4o-mini (`USD 0.15/1M`) will do - 16x savings while maintaining quality at critical stages.

The three-tool rule: if an agent needs more than 10 tools - it's a signal that the agent is doing too much. Split it into two specialized ones. The optimal zone: 3-7 tools per agent.

For the coder agent in a multi-agent system, the model Claude Sonnet is chosen, and for the researcher agent - GPT-4o. What is the reason for this choice?

Orchestration: Pipeline, Handoff Protocol, Shared State

Agents are defined. Now the question shifts to a different level: how data flows between them, how shared state via Redis synchronizes parallel workers, how an error in one agent propagates through the entire system. Two main approaches: **pipeline** (sequential conveyor with a fixed order) and **dynamic routing** via supervisor, deciding on the fly.

AspectPipelineDynamic Routing (Supervisor)
Agent orderFixed: A → B → CDynamic: supervisor decides
FlexibilityLow - always the same sequenceHigh - can skip steps, go back
PredictabilityHigh - always know what comes nextMedium - depends on supervisor decisions
CostFixed (N agent calls)Variable (supervisor + N agent calls)
ComplexitySimple (linear)Medium (graph with conditions)
When to useClear workflow: research → writing → editingUncertain tasks with branching

Start with a pipeline - it's simpler, more predictable, and cheaper. Switch to the supervisor pattern only when the pipeline can't handle it: tasks require branching, skipping steps, or dynamic agent ordering.

The reviewer found a critical bug in the code. In the supervisor pattern, what happens next?

Production: Cost, Latency, Debugging Multi-Agent Systems

A multi-agent system looks impressive in a demo. In production, numbers surface: supervisor + 3 agents x 5 steps = 20 API calls per task. At `USD 2.50/1M` tokens for GPT-4o that is `USD 0.80` per request, which cost `USD 0.15` with a single agent. Each new agent adds not just API cost, but also a new failure point and +5-10 seconds of latency. Without factoring this in - an expensive toy, not a production tool.

Debugging is even harder. An error in the final output could have entered at any stage: the researcher found incorrect data, the supervisor delegated wrong, the coder ignored context, the reviewer missed a bug. Without tracing through LangSmith or OpenTelemetry - finding a needle in a haystack. **Error propagation between agents** is the first problem teams hit in production.

MetricSingle AgentMulti-Agent (3)Multi-Agent (5)
API calls per task3-510-2020-40
Cost (GPT-4o)USD 0.02-0.05USD 0.10-0.30USD 0.30-0.80
Latency5-15 sec20-60 sec60-180 sec
Failure points14 (supervisor + 3)6 (supervisor + 5)
Debugging complexityLowMediumHigh
  • **Tracing** - tools like LangSmith, Arize Phoenix, or OpenTelemetry show the full path of a request through agents: who decided what, which tools were called, how much it cost
  • **Replay** - the ability to "replay" a session with a modified prompt for one agent, without restarting the entire system
  • **A/B testing** - comparing configurations: 3 agents vs 5 agents, GPT-4o vs Claude, supervisor vs pipeline
  • **Graceful degradation** - if one agent goes down (rate limit, timeout), the system continues working at reduced quality rather than crashing entirely

Rule for production: start with the **minimum number of agents** and only add new ones when metrics show the need. 2 well-tuned agents will give a better result than 5 hastily configured ones. Each new agent is +`USD 0.05`-0.10 per task and +5-10 seconds of latency.

Setting temperature=0 makes a multi-agent pipeline fully deterministic and reproducible

Deterministic individual agents do not produce a deterministic pipeline. Sources of non-determinism: tool call ordering with parallel agents, race conditions in shared state, non-deterministic retrieval results from vector databases, and model API changes between versions

Reproducibility in multi-agent systems requires deterministic orchestration logic, not just deterministic model outputs. Pipeline-level determinism is a separate concern from token-level determinism

A multi-agent system uses GPT-4o (USD 2.50/1M tokens) for all 5 agents. The cost per task is USD 0.80. Which optimization will yield the greatest savings?

More agents = smarter system

More agents = more failure points, higher cost, harder debugging. Errors multiply as they propagate through the agent chain

Every agent adds latency (5-10 sec), cost (`USD 0.05`-0.10), and a probability of error. An error from the researcher gets passed to the coder as fact, the coder passes it to the reviewer as working code - error propagation travels the full chain. 2 well-configured agents consistently outperform 5 hastily added ones. Coordination is expensive - in tokens, time, and debugging complexity.

A multi-agent system always outperforms a single agent

Multi-agent is only justified with 20+ tools, complex branching tasks, or when one agent needs to verify another's work

For "answer a customer question" - a single agent with 5 tools and a sharp system prompt wins: no context-passing overhead, no supervisor latency, no multiplied cost. Architectural decisions should be driven by metrics, not intuition.

Key Takeaways

  • Multi-agent is justified with 20+ tools, complex tasks, need for different perspectives - not for simple bots
  • Supervisor pattern - centralized control via LangGraph; peer-to-peer via handoff (OpenAI Swarm, Vercel AI SDK); hierarchical - for 10+ agents
  • Specialization: each agent - its own system prompt, 3-7 tools, optimal model (GPT-4o for reasoning, Claude for code)
  • Pipeline for predictable workflows, dynamic routing for branching tasks - always start with a pipeline
  • Errors multiply through the agent chain - tracing via LangSmith is mandatory; model routing to gpt-4o-mini for supervisor cuts cost by 40-50%

Вопросы для размышления

  • A multi-agent pipeline produces inconsistent results across runs despite deterministic agents (temperature=0). What are the three most likely causes, and how would each be diagnosed?

What's Next

Multi-agent systems are the top level of abstraction in AI Engineering. Next - practical skills: how to evaluate AI system quality, how to manage costs in production, how to ensure security.

  • AI Evaluation and Testing — How to measure multi-agent system quality: accuracy, coherence, task completion rate
  • Cost Management — Cost optimization strategies: model routing, caching, prompt compression
  • AI Safety and Security — Prompt injection, tool abuse, data leakage - protecting multi-agent systems

Связанные уроки

  • aie-18-agent-frameworks — Multi-agent systems are built on agent frameworks
  • aie-31-evaluation — Multi-agent quality needs measurable evaluation
  • aie-34-prompt-injection-deep — More agents means a larger attack surface
  • aie-29-cost-management — Multiple agents multiply token cost
  • sd-10-microservices — Specialized agents mirror specialized microservices
  • ml-48-rl-intro — Agent coordination relates to multi-agent reinforcement learning
  • net-55-message-queues
  • net-53-distributed-intro
Multi-Agent Systems: Orchestration, Communication, Agent Specialization

0

1

Sign In