AI Engineering

Batch API: Process Thousands of Documents at Half the Cost

Цели урока

  • Understand the difference between real-time and batch modes and when each is appropriate
  • Implement a batch pipeline via OpenAI Batch API: JSONL, polling, results
  • Evaluate the Anthropic Batch API honestly: capabilities and known limitations in 2026
  • Build a production-ready pipeline with retry, monitoring, and partial failure handling

100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.

  • OpenAI Batch API: 50% discount, 24h SLA (most batches in 1-6h), supports gpt-4o, gpt-4o-mini, embeddings
  • Anthropic Batch API: 50% discount, up to 10,000 requests/batch - documented reliability issues April 2026
  • Typical batch tasks: nightly document indexing, bulk review classification, SEO optimization, bulk embeddings generation
  • 35-48% reduction in monthly LLM costs when moving 40-60% of traffic to batch (measured in production)

Предварительные знания

  • LLM API integration: requests, responses, message format, error handling
  • Understanding tokens and cost: price per input/output token, what makes up the bill
  • Working with async code: polling, queues, background jobs
  • Cost Management: Controlling LLM Spend
  • LLM API Integration

Batch API: Asynchronous Processing at Half the Price

Before 2024, bulk data processing through an LLM was costly: each of hundreds of thousands of documents went out as a synchronous request at full price, even when the result was only needed by morning. In 2024 OpenAI launched the Batch API, an asynchronous mode where requests are collected into one JSONL file, processed by the provider within a 24-hour window, and returned as a results file with a 50% discount on input and output tokens. The economics are simple: a 24-hour SLA lets the provider group tasks and use spare GPU capacity during off-peak hours, the same model and the same weights, just better hardware utilization. This turned a whole class of jobs (nightly indexing, bulk classification, embedding generation, reports) from expensive into nearly free compared to real-time. A similar mode with a 50% discount and a 10,000-request-per-batch limit later appeared at Anthropic (the Message Batches API). One engineering detail matters: every request carries a custom_id, because result order is not guaranteed and it is the only way to match a response back to its source document.

Synchronous vs Asynchronous AI: Two Operating Modes

100,000 documents need to be classified. Real-time is impossible and expensive. Batch API does it overnight at half the price - a pattern every production AI engineer knows.

Two modes of working with LLM APIs define different architectures and different costs:

CriterionSynchronousBatch
Latency1-10 seconds1-24 hours
Price (GPT-4o)USD 2.50/1M input tokensUSD 1.25/1M input tokens
User experienceInteractiveBackground process
Failure handlingRetry immediatelyRetry on schedule
ScaleTens/hundreds per minuteMillions per day
When to useUser-facing, <1s latency matters>1000 requests, latency not critical

Real ROI of batch processing in production:

Measured results show 35-48% reduction in monthly LLM API costs when moving 40-60% of traffic to batch mode. Typical batch-eligible tasks: nightly document indexing, bulk review classification, report generation, bulk embeddings generation.

The Batch API costs 50% of the real-time price because:

OpenAI Batch API: Mechanics and Practice

OpenAI Batch API: JSONL file with requests -> upload -> batch creation -> polling -> download results. 24-hour SLA, most batches complete in 1-6 hours.

ParameterOpenAI Batch APIValue
Discount50%GPT-4o: USD 2.50 -> USD 1.25 per 1M input
SLA24 hoursMost complete in 1-6h
Max file size200 MB / 50K requestsOne JSONL file
Supported formatschat completions, embeddingsNot all endpoints
CancellationSupportedBefore processing begins

The OpenAI Batch API is the most mature and reliable on the market. Documentation is thorough, statuses are clear, errors are transparent. For new batch processing projects, starting with OpenAI is the right call.

In the OpenAI Batch API, what is the custom_id in each request for?

Anthropic Batch API: Capabilities and Honest Limitations

Anthropic Message Batches API: the same 50% discount, up to 10,000 requests per batch. The API works, but has documented reliability issues (April 2026).

Documented issues with Anthropic Batch API (April 2026): opaque errors with no details (errored_request_count grows without explanation), no per-item progress during processing, inability to cancel a batch already in progress, rare cases where an entire batch completes silently with 0 results. For critical tasks: monitoring + fallback to real-time.

ParameterOpenAI BatchAnthropic Batch
Discount50%50%
Max requests50K / file (200MB)10K / batch
Input formatJSONL fileJSON array in body
Progresscompleted/failed countersOnly processing_status
CancellationYesNo
Reliability (2026)HighMedium (known issues)
When to chooseLarge volumes, reliability criticalClaude-specific tasks, < 10K requests

Despite the limitations, Anthropic Batch API is economically justified for tasks that specifically need Claude: long-context analysis (200K context window), complex instruction following, code review. The 50% discount makes claude-sonnet-4-5 batch comparable in price to gpt-4o-mini real-time.

When should OpenAI Batch API be preferred over Anthropic Batch API?

Production Batch Pipeline: Retry, Monitoring, Partial Failures

A production batch pipeline must handle partial failures, track progress, and recover gracefully. The naive approach: lose 5% of results silently.

ProblemSymptomSolution
Partial failure5-10% of requests return errorsFilter failed, retry via a separate batch
Batch expiredBatch didn't complete within 24hSplit into smaller chunks, verify JSONL validity
No progress (Anthropic)processing_status unchanged for hours26h timeout + fallback to real-time
Lost job stateServer restarted - where's the batch?Save batchId to DB before submission
Cost spikeBatch larger than expectedmax_tokens limit + cost alerts in monitoring

Why save the batchId to the database immediately after submission?

Batch API is just a request queue with lower response quality

The same model with the same weights is used. The only difference is scheduling: the provider processes at a convenient time

GPT-4o in batch is the same GPT-4o. Anthropic batch is the same claude-sonnet. There is no quality degradation. The discount is for flexibility in delivery time.

One large batch is better than several small ones

Large batches are riskier: a single failure mode can block everything. Optimal - 1K-5K requests, parallel batches

Partial failures in a batch make it impossible to retry just the failing parts without reprocessing everything. Smaller batches enable faster retry and better progress monitoring.

Key Takeaways

  • Batch API: 50% discount for a 24-hour SLA - one of the simplest ways to cut AI costs
  • OpenAI Batch: JSONL file, status polling, transparent counters - mature and reliable
  • Anthropic Batch: same discount, but max 10K/batch and known reliability issues (2026)
  • Required: save batchId to DB, handle partial failures, implement retry strategy
  • Pricing: GPT-4o USD 2.50 -> USD 1.25 per 1M input tokens in batch mode

Вопросы для размышления

  • Which tasks in the current system run in real-time but could be deferred by several hours without any user impact?
  • How to debug Anthropic Batch API when errors are opaque - what to log and how to build a fallback?
  • At what request volume does the savings from batch justify the pipeline complexity (retry, monitoring, DB for job state)?

What's Next

Thousands of text documents processed. The next frontier: voice agents in production - latency, VAD, platforms.

  • Voice Agents in Production — Next lesson: Vapi, LiveKit, Retell - voice that doesn't drop
  • Cost Management — Batch API as part of an overall cost optimization strategy
  • Caching — Caching complements batch: identical requests aren't paid twice

Связанные уроки

  • aie-29-cost-management
  • aie-05-api-integration
  • aie-08-streaming
  • aie-28-caching-optimization
  • aie-22-model-routing
Batch API: Process Thousands of Documents at Half the Cost

0

1

Sign In