AI Engineering
Autonomous Agents: Devin, SWE-Agent, OpenHands - AI That Writes Code on Its Own
Цели урока
- Break down the autonomy spectrum: from autocomplete to fully autonomous agents
- Understand task decomposition: Plan-and-Execute, replanning on errors
- Implement the code generation loop: write → run → analyze → fix → repeat
- Study self-correction (4 levels) and the SWE-bench benchmark for evaluating agents
March 2024. Cognition Labs unveils Devin: AI receives a GitHub issue, opens a browser, writes code, runs tests, opens a PR. Cost per task: USD 5. A junior developer takes an hour and charges USD 50, and still wins half the time. But the other half, the half Devin handles without a single human prompt, is quietly reshaping how AI engineering teams are built.
- Devin (Cognition Labs) - the first fully autonomous commercial agent: receives a GitHub issue, opens a browser, writes code, runs tests, creates a PR - without messaging anyone
- SWE-Agent (Princeton) - open-source agent built on Claude/GPT-4, 26% SWE-bench Lite. Its shell interface (ACI) became the template for dozens of production systems
- OpenHands (formerly OpenDevin) - 53% SWE-bench Verified at an average USD 1.80 per task. Teams use it as a drop-in for L3 junior tasks on well-defined bugs
- GitHub Copilot Workspace - Microsoft embedded autonomous planning directly into GitHub Issues: the agent drafts a change plan, the developer approves, the agent implements
From AutoGPT to Devin and SWE-agent
On March 30, 2023 Toran Bruce Richards released AutoGPT, one of the first widely known attempts to make GPT-4 autonomous: the model set its own subgoals, called tools, and ran in a loop without a human in every step. AutoGPT often got stuck and looped, but it showed the idea of an agent that drives itself toward a goal. In October 2023 a team at Princeton (Carlos Jimenez, John Yang, and colleagues) published SWE-bench, a benchmark of real GitHub tasks: a model is given an issue and asked to submit a patch that passes the repository's tests. That gave the industry an honest metric for code-writing agents. In March 2024 Cognition Labs introduced Devin as an autonomous AI engineer that takes an issue, writes code, runs tests, and opens a PR; at launch Devin resolved about 13.9 percent of SWE-bench tasks, well above prior results. In April 2024 Princeton released SWE-agent, an open agent with an agent-computer interface that set a new state of the art on the full SWE-bench and became a base for research. The work was later presented at NeurIPS 2024.
Предварительные знания
Autonomous vs Assisted: the autonomy spectrum of AI agents
AI coding assistants live on an **autonomy spectrum** - from inline autocomplete suggesting the next line, to a fully autonomous agent that receives a GitHub issue and delivers a ready PR with tests an hour later. Cursor Agent, Claude Code, Devin are not just different products - they are fundamentally different architectural points on this spectrum.
**The defining trait of Level 4 (Fully Autonomous):** the agent operates inside a **sandbox** - a Docker container with a terminal, filesystem, and browser. Devin, for example, spins up a fresh Linux environment per task: clones the repo, installs deps, iterates until tests go green, then opens a PR. Production credentials never enter the picture by design.
| Characteristic | Copilot (L0-1) | Cursor Agent (L2) | Devin / SWE-Agent (L4) |
|---|---|---|---|
| Context | Current file | Entire project | Project + docs + web |
| Actions | Suggest code | Edit + run terminal | Edit + run + browse + git |
| Cycle | Single step | Multi-step (5-20) | Long-horizon (50-200 steps) |
| Self-correction | None | Retry with error | Analyze → fix → retest |
| Sandbox | None | IDE terminal | Docker container |
| Cost per task (USD) | 0.01 | 0.10-1.00 | 1-50 |
| SWE-bench Lite | N/A | ~20% | 26-49% |
**SWE-bench** is a benchmark from Princeton for evaluating autonomous agents. 2,294 real GitHub issues from Python open-source projects. The agent receives an issue description and must create a patch that passes the tests. Best result (2026): Claude Code - 72.0% on SWE-bench Verified.
What fundamentally distinguishes a fully autonomous agent (Devin, SWE-Agent) from an agentic IDE (Cursor Agent)?
Task Decomposition: how the agent breaks a task into steps
Solving a real GitHub issue in a single LLM call is physically impossible - the codebase context alone won't fit. The first step is always **task decomposition**: turning a vague issue description into a sequence of atomic actions. SWE-Agent uses ReAct (action + observe result), Devin uses Plan-and-Execute (full plan first, then execution with replanning on failure), Claude Code uses a hybrid with a dynamic step tree.
**Replanning is the key to quality.** SWE-Agent without replanning solves 15% of SWE-bench. With replanning (analyze error → new plan) - 26%. With iterative replanning (up to 5 attempts) - up to 40%. Each iteration costs USD 0.50-2.00 in tokens.
Why is replanning (recomposing the plan after an error) critically important for autonomous agents?
Code Generation Loop: write → run → fix → repeat
The heart of an autonomous agent is the **generation loop**: write code, run it in a sandbox, parse the errors, fix, repeat. OpenHands averages 60 iterations per SWE-bench task, Claude Code around 80. Each step sees the full history of prior actions and their outputs - that accumulated context is what creates the impression of coherent understanding across a long task.
**Context Window Management** is the main technical challenge. History from 100 iterations x 2000 chars = 200K chars. Strategies: summarize old history, keep only the last N actions + initial plan, compression via LLM.
An autonomous agent solves a GitHub issue in 50 iterations. Each iteration is 1 LLM call with ~3000 input tokens and ~500 output tokens. At USD 3 per 1M input and USD 15 per 1M output, what is the approximate cost in dollars?
Self-Correction and SWE-bench: how to evaluate autonomous agents
Self-correction is the agent's ability to **detect and fix its own errors** without external prompting. This is where the gap between L2 and L4 becomes a chasm: Cursor Agent stops and asks a human, Devin reads the stack trace, traces the dependency, and fixes related files autonomously. Four levels of correction difficulty have dramatically different success rates.
**SWE-bench** is the primary benchmark for autonomous coding agents:
| Agent | SWE-bench Verified (%) | Avg Cost/Task (USD) | Avg Steps | Year |
|---|---|---|---|---|
| Claude Code (Anthropic) | 72.0% | ~2.50 | ~80 | 2025 |
| OpenHands + Claude | 53.0% | ~1.80 | ~60 | 2025 |
| Devin (Cognition) | 48.4% | ~5.00 | ~120 | 2025 |
| SWE-Agent + GPT-4o | 26.0% | ~1.20 | ~40 | 2024 |
| RAG baseline | 4.8% | ~0.10 | 1 | 2024 |
**SWE-bench does not equal real development.** The benchmark tests bug fixes in Python open-source. Real tasks are harder: greenfield development, UI, infra, multi-repo. SWE-bench is a useful proxy, but not an absolute metric.
**How to safely deploy autonomous agents in production:**
- **Start with well-defined tasks:** bug fixes with clear reproduction steps, not vague feature requests
- **Mandatory code review:** the autonomous agent creates a PR, a human reviews. Never auto-merge
- **Sandbox with restrictions:** network access whitelist, no production credentials, time limit
- **Cost budget per task:** USD 5-10 maximum. If the agent loops - kill and escalate to a human
- **Gradually increase autonomy:** start with L2 (Cursor Agent), then L3 (Claude Code), then L4 - as trust grows
Which level of self-correction is the hardest for an autonomous agent?
Key Takeaways
- Don't jump to L4 (Devin, OpenHands) without a solid code review process in place - autonomy without review is technical debt squared
- Replanning is not a bonus feature, it's the core mechanism: a static plan breaks against real codebases. Without it: 15% SWE-bench score. With iterative replanning: 40%+
- Count agent cost in iterations, not API calls: 50 iterations x (3K input + 500 output tokens) = USD 0.83 per task on Claude Sonnet
- Self-correction climbs from the bottom up: compile-time (95%) every agent handles; cross-file reasoning (20-30%) only top agents with dependency graph awareness
- SWE-bench is a useful signal, not a guarantee: it tests Python bug fixes only. Greenfield, UI, multi-repo work is harder than the numbers suggest
What's Next
Autonomous agents generate code. But for that code to reach the user - proper UX is needed. The next lesson covers AI UX patterns: streaming UI, confidence indicators, human-in-the-loop.
- AI UX Patterns — How to design interfaces for AI products - streaming, confidence, human-in-the-loop
- Multi-Agent Systems — The multi-agent architecture on which autonomous agents are built
Связанные уроки
- aie-19-multi-agent — Autonomous agents build on multi-agent orchestration
- aie-16-tool-calling — Agents act through tool calls in a sandbox
- aie-66-agent-sandboxes — Sandboxing isolates autonomous code execution safely
- aie-31-evaluation — SWE-bench is the evaluation harness for agents
- ml-48-rl-intro — Plan-act-observe loop mirrors RL agent-environment cycle
- alg-21-dp — Task decomposition reuses subproblem-solving structure
- alg-37-a-star
- net-55-message-queues