AI Engineering

Autonomous Agents: Devin, SWE-Agent, OpenHands - AI That Writes Code on Its Own

Цели урока

Break down the autonomy spectrum: from autocomplete to fully autonomous agents
Understand task decomposition: Plan-and-Execute, replanning on errors
Implement the code generation loop: write → run → analyze → fix → repeat
Study self-correction (4 levels) and the SWE-bench benchmark for evaluating agents

March 2024. Cognition Labs unveils Devin: AI receives a GitHub issue, opens a browser, writes code, runs tests, opens a PR. Cost per task: USD 5. A junior developer takes an hour and charges USD 50, and still wins half the time. But the other half, the half Devin handles without a single human prompt, is quietly reshaping how AI engineering teams are built.

Devin (Cognition Labs) - the first fully autonomous commercial agent: receives a GitHub issue, opens a browser, writes code, runs tests, creates a PR - without messaging anyone
SWE-Agent (Princeton) - open-source agent built on Claude/GPT-4, 26% SWE-bench Lite. Its shell interface (ACI) became the template for dozens of production systems
OpenHands (formerly OpenDevin) - 53% SWE-bench Verified at an average USD 1.80 per task. Teams use it as a drop-in for L3 junior tasks on well-defined bugs
GitHub Copilot Workspace - Microsoft embedded autonomous planning directly into GitHub Issues: the agent drafts a change plan, the developer approves, the agent implements

From AutoGPT to Devin and SWE-agent

On March 30, 2023 Toran Bruce Richards released AutoGPT, one of the first widely known attempts to make GPT-4 autonomous: the model set its own subgoals, called tools, and ran in a loop without a human in every step. AutoGPT often got stuck and looped, but it showed the idea of an agent that drives itself toward a goal. In October 2023 a team at Princeton (Carlos Jimenez, John Yang, and colleagues) published SWE-bench, a benchmark of real GitHub tasks: a model is given an issue and asked to submit a patch that passes the repository's tests. That gave the industry an honest metric for code-writing agents. In March 2024 Cognition Labs introduced Devin as an autonomous AI engineer that takes an issue, writes code, runs tests, and opens a PR; at launch Devin resolved about 13.9 percent of SWE-bench tasks, well above prior results. In April 2024 Princeton released SWE-agent, an open agent with an agent-computer interface that set a new state of the art on the full SWE-bench and became a base for research. The work was later presented at NeurIPS 2024.

Предварительные знания

Multi-Agent Systems: Orchestration, Communication, Agent Specialization

Autonomous vs Assisted: the autonomy spectrum of AI agents

AI coding assistants live on an **autonomy spectrum** - from inline autocomplete suggesting the next line, to a fully autonomous agent that receives a GitHub issue and delivers a ready PR with tests an hour later. Cursor Agent, Claude Code, Devin are not just different products - they are fundamentally different architectural points on this spectrum.

**The defining trait of Level 4 (Fully Autonomous):** the agent operates inside a **sandbox** - a Docker container with a terminal, filesystem, and browser. Devin, for example, spins up a fresh Linux environment per task: clones the repo, installs deps, iterates until tests go green, then opens a PR. Production credentials never enter the picture by design.

Characteristic	Copilot (L0-1)	Cursor Agent (L2)	Devin / SWE-Agent (L4)
Context	Current file	Entire project	Project + docs + web
Actions	Suggest code	Edit + run terminal	Edit + run + browse + git
Cycle	Single step	Multi-step (5-20)	Long-horizon (50-200 steps)
Self-correction	None	Retry with error	Analyze → fix → retest
Sandbox	None	IDE terminal	Docker container
Cost per task (USD)	0.01	0.10-1.00	1-50
SWE-bench Lite	N/A	~20%	26-49%

**SWE-bench** is a benchmark from Princeton for evaluating autonomous agents. 2,294 real GitHub issues from Python open-source projects. The agent receives an issue description and must create a patch that passes the tests. Best result (2026): Claude Code - 72.0% on SWE-bench Verified.

What fundamentally distinguishes a fully autonomous agent (Devin, SWE-Agent) from an agentic IDE (Cursor Agent)?

Task Decomposition: how the agent breaks a task into steps

Solving a real GitHub issue in a single LLM call is physically impossible - the codebase context alone won't fit. The first step is always **task decomposition**: turning a vague issue description into a sequence of atomic actions. SWE-Agent uses ReAct (action + observe result), Devin uses Plan-and-Execute (full plan first, then execution with replanning on failure), Claude Code uses a hybrid with a dynamic step tree.

**Replanning is the key to quality.** SWE-Agent without replanning solves 15% of SWE-bench. With replanning (analyze error → new plan) - 26%. With iterative replanning (up to 5 attempts) - up to 40%. Each iteration costs USD 0.50-2.00 in tokens.

Why is replanning (recomposing the plan after an error) critically important for autonomous agents?

Code Generation Loop: write → run → fix → repeat

The heart of an autonomous agent is the **generation loop**: write code, run it in a sandbox, parse the errors, fix, repeat. OpenHands averages 60 iterations per SWE-bench task, Claude Code around 80. Each step sees the full history of prior actions and their outputs - that accumulated context is what creates the impression of coherent understanding across a long task.

**Context Window Management** is the main technical challenge. History from 100 iterations x 2000 chars = 200K chars. Strategies: summarize old history, keep only the last N actions + initial plan, compression via LLM.

An autonomous agent solves a GitHub issue in 50 iterations. Each iteration is 1 LLM call with ~3000 input tokens and ~500 output tokens. At USD 3 per 1M input and USD 15 per 1M output, what is the approximate cost in dollars?

Self-Correction and SWE-bench: how to evaluate autonomous agents

Self-correction is the agent's ability to **detect and fix its own errors** without external prompting. This is where the gap between L2 and L4 becomes a chasm: Cursor Agent stops and asks a human, Devin reads the stack trace, traces the dependency, and fixes related files autonomously. Four levels of correction difficulty have dramatically different success rates.

**SWE-bench** is the primary benchmark for autonomous coding agents:

Agent	SWE-bench Verified (%)	Avg Cost/Task (USD)	Avg Steps	Year
Claude Code (Anthropic)	72.0%	~2.50	~80	2025
OpenHands + Claude	53.0%	~1.80	~60	2025
Devin (Cognition)	48.4%	~5.00	~120	2025
SWE-Agent + GPT-4o	26.0%	~1.20	~40	2024
RAG baseline	4.8%	~0.10	1	2024

**SWE-bench does not equal real development.** The benchmark tests bug fixes in Python open-source. Real tasks are harder: greenfield development, UI, infra, multi-repo. SWE-bench is a useful proxy, but not an absolute metric.

**How to safely deploy autonomous agents in production:**

**Start with well-defined tasks:** bug fixes with clear reproduction steps, not vague feature requests
**Mandatory code review:** the autonomous agent creates a PR, a human reviews. Never auto-merge
**Sandbox with restrictions:** network access whitelist, no production credentials, time limit
**Cost budget per task:** USD 5-10 maximum. If the agent loops - kill and escalate to a human
**Gradually increase autonomy:** start with L2 (Cursor Agent), then L3 (Claude Code), then L4 - as trust grows

Which level of self-correction is the hardest for an autonomous agent?

Key Takeaways

Don't jump to L4 (Devin, OpenHands) without a solid code review process in place - autonomy without review is technical debt squared
Replanning is not a bonus feature, it's the core mechanism: a static plan breaks against real codebases. Without it: 15% SWE-bench score. With iterative replanning: 40%+
Count agent cost in iterations, not API calls: 50 iterations x (3K input + 500 output tokens) = USD 0.83 per task on Claude Sonnet
Self-correction climbs from the bottom up: compile-time (95%) every agent handles; cross-file reasoning (20-30%) only top agents with dependency graph awareness
SWE-bench is a useful signal, not a guarantee: it tests Python bug fixes only. Greenfield, UI, multi-repo work is harder than the numbers suggest

What's Next

Autonomous agents generate code. But for that code to reach the user - proper UX is needed. The next lesson covers AI UX patterns: streaming UI, confidence indicators, human-in-the-loop.

AI UX Patterns — How to design interfaces for AI products - streaming, confidence, human-in-the-loop
Multi-Agent Systems — The multi-agent architecture on which autonomous agents are built

Связанные уроки

aie-19-multi-agent — Autonomous agents build on multi-agent orchestration
aie-16-tool-calling — Agents act through tool calls in a sandbox
aie-66-agent-sandboxes — Sandboxing isolates autonomous code execution safely
aie-31-evaluation — SWE-bench is the evaluation harness for agents
ml-48-rl-intro — Plan-act-observe loop mirrors RL agent-environment cycle
alg-21-dp — Task decomposition reuses subproblem-solving structure
alg-37-a-star
net-55-message-queues