AI Engineering

Batch API: Process Thousands of Documents at Half the Cost

Цели урока

Understand the difference between real-time and batch modes and when each is appropriate
Implement a batch pipeline via OpenAI Batch API: JSONL, polling, results
Evaluate the Anthropic Batch API honestly: capabilities and known limitations in 2026
Build a production-ready pipeline with retry, monitoring, and partial failure handling

100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.

OpenAI Batch API: 50% discount, 24h SLA (most batches in 1-6h), supports gpt-4o, gpt-4o-mini, embeddings
Anthropic Batch API: 50% discount, up to 10,000 requests/batch - documented reliability issues April 2026
Typical batch tasks: nightly document indexing, bulk review classification, SEO optimization, bulk embeddings generation
35-48% reduction in monthly LLM costs when moving 40-60% of traffic to batch (measured in production)

Предварительные знания

LLM API integration: requests, responses, message format, error handling
Understanding tokens and cost: price per input/output token, what makes up the bill
Working with async code: polling, queues, background jobs

Batch API: Asynchronous Processing at Half the Price

Before 2024, bulk data processing through an LLM was costly: each of hundreds of thousands of documents went out as a synchronous request at full price, even when the result was only needed by morning. In 2024 OpenAI launched the Batch API, an asynchronous mode where requests are collected into one JSONL file, processed by the provider within a 24-hour window, and returned as a results file with a 50% discount on input and output tokens. The economics are simple: a 24-hour SLA lets the provider group tasks and use spare GPU capacity during off-peak hours, the same model and the same weights, just better hardware utilization. This turned a whole class of jobs (nightly indexing, bulk classification, embedding generation, reports) from expensive into nearly free compared to real-time. A similar mode with a 50% discount and a 10,000-request-per-batch limit later appeared at Anthropic (the Message Batches API). One engineering detail matters: every request carries a custom_id, because result order is not guaranteed and it is the only way to match a response back to its source document.

Synchronous vs Asynchronous AI: Two Operating Modes

100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.

Two modes of working with LLM APIs define different architectures and different costs:

Criterion	Synchronous	Batch
Latency	1-10 seconds	1-24 hours
Price (GPT-4o)	USD 2.50/1M input tokens	USD 1.25/1M input tokens
User experience	Interactive	Background process
Failure handling	Retry immediately	Retry on schedule
Scale	Tens/hundreds per minute	Millions per day
When to use	User-facing, <1s latency matters	>1000 requests, latency not critical

Real ROI of batch processing in production:

Measured results show 35-48% reduction in monthly LLM API costs when moving 40-60% of traffic to batch mode. Typical batch-eligible tasks: nightly document indexing, bulk review classification, report generation, bulk embeddings generation.

The Batch API costs 50% of the real-time price because:

OpenAI Batch API: Mechanics and Practice

OpenAI Batch API: JSONL file with requests -> upload -> batch creation -> polling -> download results. 24-hour SLA, most batches complete in 1-6 hours.

Parameter	OpenAI Batch API	Value
Discount	50%	GPT-4o: USD 2.50 -> USD 1.25 per 1M input
SLA	24 hours	Most complete in 1-6h
Max file size	200 MB / 50K requests	One JSONL file
Supported formats	chat completions, embeddings	Not all endpoints
Cancellation	Supported	Before processing begins

The OpenAI Batch API is the most mature and reliable on the market. Documentation is thorough, statuses are clear, errors are transparent. For new batch processing projects, starting with OpenAI is the right call.

In the OpenAI Batch API, what is the custom_id in each request for?

Anthropic Batch API: Capabilities and Honest Limitations

Anthropic Message Batches API: the same 50% discount, up to 10,000 requests per batch. The API works, but has documented reliability issues (April 2026).

Documented issues with Anthropic Batch API (April 2026): opaque errors with no details (errored_request_count grows without explanation), no per-item progress during processing, inability to cancel a batch already in progress, rare cases where an entire batch completes silently with 0 results. For critical tasks: monitoring + fallback to real-time.

Parameter	OpenAI Batch	Anthropic Batch
Discount	50%	50%
Max requests	50K / file (200MB)	10K / batch
Input format	JSONL file	JSON array in body
Progress	completed/failed counters	Only processing_status
Cancellation	Yes	No
Reliability (2026)	High	Medium (known issues)
When to choose	Large volumes, reliability critical	Claude-specific tasks, < 10K requests

Despite the limitations, Anthropic Batch API is economically justified for tasks that specifically need Claude: long-context analysis (200K context window), complex instruction following, code review. The 50% discount makes claude-sonnet-4-5 batch comparable in price to gpt-4o-mini real-time.

When should OpenAI Batch API be preferred over Anthropic Batch API?

Production Batch Pipeline: Retry, Monitoring, Partial Failures

A production batch pipeline must handle partial failures, track progress, and recover gracefully. The naive approach: lose 5% of results silently.

Problem	Symptom	Solution
Partial failure	5-10% of requests return errors	Filter failed, retry via a separate batch
Batch expired	Batch didn't complete within 24h	Split into smaller chunks, verify JSONL validity
No progress (Anthropic)	processing_status unchanged for hours	26h timeout + fallback to real-time
Lost job state	Server restarted - where's the batch?	Save batchId to DB before submission
Cost spike	Batch larger than expected	max_tokens limit + cost alerts in monitoring

Why save the batchId to the database immediately after submission?

Batch API is just a request queue with lower response quality

The same model with the same weights is used. The only difference is scheduling: the provider processes at a convenient time

GPT-4o in batch is the same GPT-4o. Anthropic batch is the same claude-sonnet. There is no quality degradation. The discount is for flexibility in delivery time.

One large batch is better than several small ones

Large batches are riskier: a single failure mode can block everything. Optimal - 1K-5K requests, parallel batches

Partial failures in a batch make it impossible to retry just the failing parts without reprocessing everything. Smaller batches enable faster retry and better progress monitoring.

Key Takeaways

Batch API: 50% discount for a 24-hour SLA - one of the simplest ways to cut AI costs
OpenAI Batch: JSONL file, status polling, transparent counters - mature and reliable
Anthropic Batch: same discount, but max 10K/batch and known reliability issues (2026)
Required: save batchId to DB, handle partial failures, implement retry strategy
Pricing: GPT-4o USD 2.50 -> USD 1.25 per 1M input tokens in batch mode

Вопросы для размышления

Which tasks in the current system run in real-time but could be deferred by several hours without any user impact?
How to debug Anthropic Batch API when errors are opaque - what to log and how to build a fallback?
At what request volume does the savings from batch justify the pipeline complexity (retry, monitoring, DB for job state)?

What's Next

Thousands of text documents processed. The next frontier: voice agents in production - latency, VAD, platforms.

Voice Agents in Production — Next lesson: Vapi, LiveKit, Retell - voice that doesn't drop
Cost Management — Batch API as part of an overall cost optimization strategy
Caching — Caching complements batch: identical requests aren't paid twice