AI Engineering
Batch API: Process Thousands of Documents at Half the Cost
Цели урока
- Understand the difference between real-time and batch modes and when each is appropriate
- Implement a batch pipeline via OpenAI Batch API: JSONL, polling, results
- Evaluate the Anthropic Batch API honestly: capabilities and known limitations in 2026
- Build a production-ready pipeline with retry, monitoring, and partial failure handling
100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.
- OpenAI Batch API: 50% discount, 24h SLA (most batches in 1-6h), supports gpt-4o, gpt-4o-mini, embeddings
- Anthropic Batch API: 50% discount, up to 10,000 requests/batch - documented reliability issues April 2026
- Typical batch tasks: nightly document indexing, bulk review classification, SEO optimization, bulk embeddings generation
- 35-48% reduction in monthly LLM costs when moving 40-60% of traffic to batch (measured in production)
Предварительные знания
- LLM API integration: requests, responses, message format, error handling
- Understanding tokens and cost: price per input/output token, what makes up the bill
- Working with async code: polling, queues, background jobs
Batch API: Asynchronous Processing at Half the Price
Before 2024, bulk data processing through an LLM was costly: each of hundreds of thousands of documents went out as a synchronous request at full price, even when the result was only needed by morning. In 2024 OpenAI launched the Batch API, an asynchronous mode where requests are collected into one JSONL file, processed by the provider within a 24-hour window, and returned as a results file with a 50% discount on input and output tokens. The economics are simple: a 24-hour SLA lets the provider group tasks and use spare GPU capacity during off-peak hours, the same model and the same weights, just better hardware utilization. This turned a whole class of jobs (nightly indexing, bulk classification, embedding generation, reports) from expensive into nearly free compared to real-time. A similar mode with a 50% discount and a 10,000-request-per-batch limit later appeared at Anthropic (the Message Batches API). One engineering detail matters: every request carries a custom_id, because result order is not guaranteed and it is the only way to match a response back to its source document.
Synchronous vs Asynchronous AI: Two Operating Modes
100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.
Two modes of working with LLM APIs define different architectures and different costs:
| Criterion | Synchronous | Batch |
|---|---|---|
| Latency | 1-10 seconds | 1-24 hours |
| Price (GPT-4o) | USD 2.50/1M input tokens | USD 1.25/1M input tokens |
| User experience | Interactive | Background process |
| Failure handling | Retry immediately | Retry on schedule |
| Scale | Tens/hundreds per minute | Millions per day |
| When to use | User-facing, <1s latency matters | >1000 requests, latency not critical |
Real ROI of batch processing in production:
Measured results show 35-48% reduction in monthly LLM API costs when moving 40-60% of traffic to batch mode. Typical batch-eligible tasks: nightly document indexing, bulk review classification, report generation, bulk embeddings generation.
The Batch API costs 50% of the real-time price because:
OpenAI Batch API: Mechanics and Practice
OpenAI Batch API: JSONL file with requests -> upload -> batch creation -> polling -> download results. 24-hour SLA, most batches complete in 1-6 hours.
| Parameter | OpenAI Batch API | Value |
|---|---|---|
| Discount | 50% | GPT-4o: USD 2.50 -> USD 1.25 per 1M input |
| SLA | 24 hours | Most complete in 1-6h |
| Max file size | 200 MB / 50K requests | One JSONL file |
| Supported formats | chat completions, embeddings | Not all endpoints |
| Cancellation | Supported | Before processing begins |
The OpenAI Batch API is the most mature and reliable on the market. Documentation is thorough, statuses are clear, errors are transparent. For new batch processing projects, starting with OpenAI is the right call.
In the OpenAI Batch API, what is the custom_id in each request for?
Anthropic Batch API: Capabilities and Honest Limitations
Anthropic Message Batches API: the same 50% discount, up to 10,000 requests per batch. The API works, but has documented reliability issues (April 2026).
Documented issues with Anthropic Batch API (April 2026): opaque errors with no details (errored_request_count grows without explanation), no per-item progress during processing, inability to cancel a batch already in progress, rare cases where an entire batch completes silently with 0 results. For critical tasks: monitoring + fallback to real-time.
| Parameter | OpenAI Batch | Anthropic Batch |
|---|---|---|
| Discount | 50% | 50% |
| Max requests | 50K / file (200MB) | 10K / batch |
| Input format | JSONL file | JSON array in body |
| Progress | completed/failed counters | Only processing_status |
| Cancellation | Yes | No |
| Reliability (2026) | High | Medium (known issues) |
| When to choose | Large volumes, reliability critical | Claude-specific tasks, < 10K requests |
Despite the limitations, Anthropic Batch API is economically justified for tasks that specifically need Claude: long-context analysis (200K context window), complex instruction following, code review. The 50% discount makes claude-sonnet-4-5 batch comparable in price to gpt-4o-mini real-time.
When should OpenAI Batch API be preferred over Anthropic Batch API?
Production Batch Pipeline: Retry, Monitoring, Partial Failures
A production batch pipeline must handle partial failures, track progress, and recover gracefully. The naive approach: lose 5% of results silently.
| Problem | Symptom | Solution |
|---|---|---|
| Partial failure | 5-10% of requests return errors | Filter failed, retry via a separate batch |
| Batch expired | Batch didn't complete within 24h | Split into smaller chunks, verify JSONL validity |
| No progress (Anthropic) | processing_status unchanged for hours | 26h timeout + fallback to real-time |
| Lost job state | Server restarted - where's the batch? | Save batchId to DB before submission |
| Cost spike | Batch larger than expected | max_tokens limit + cost alerts in monitoring |
Why save the batchId to the database immediately after submission?
Batch API is just a request queue with lower response quality
The same model with the same weights is used. The only difference is scheduling: the provider processes at a convenient time
GPT-4o in batch is the same GPT-4o. Anthropic batch is the same claude-sonnet. There is no quality degradation. The discount is for flexibility in delivery time.
One large batch is better than several small ones
Large batches are riskier: a single failure mode can block everything. Optimal - 1K-5K requests, parallel batches
Partial failures in a batch make it impossible to retry just the failing parts without reprocessing everything. Smaller batches enable faster retry and better progress monitoring.
Key Takeaways
- Batch API: 50% discount for a 24-hour SLA - one of the simplest ways to cut AI costs
- OpenAI Batch: JSONL file, status polling, transparent counters - mature and reliable
- Anthropic Batch: same discount, but max 10K/batch and known reliability issues (2026)
- Required: save batchId to DB, handle partial failures, implement retry strategy
- Pricing: GPT-4o USD 2.50 -> USD 1.25 per 1M input tokens in batch mode
Вопросы для размышления
- Which tasks in the current system run in real-time but could be deferred by several hours without any user impact?
- How to debug Anthropic Batch API when errors are opaque - what to log and how to build a fallback?
- At what request volume does the savings from batch justify the pipeline complexity (retry, monitoring, DB for job state)?
What's Next
Thousands of text documents processed. The next frontier: voice agents in production - latency, VAD, platforms.
- Voice Agents in Production — Next lesson: Vapi, LiveKit, Retell - voice that doesn't drop
- Cost Management — Batch API as part of an overall cost optimization strategy
- Caching — Caching complements batch: identical requests aren't paid twice