AI Engineering

Cost Management: Counting Tokens, Optimizing Prompts, Choosing Models by Budget

Цели урока

  • Calculate LLM request costs accounting for input/output tokens across different models
  • Apply prompt compression techniques to reduce input tokens by 40-70%
  • Implement model tiering - routing tasks to a model of appropriate power
  • Build a budget alert system with per-request, daily, and monthly limits
  • Create a cost dashboard with unit economics for informed business decisions

A startup launched an AI feature. One endpoint, one bug in a prompt - 10,000 tokens per request instead of 500. A month later: a USD 50K bill instead of the expected USD 5K. One user with unusual behavior was generating 80% of all spending. No per-user quotas, no spending alerts, no cost per request tracking. Cost management is not about saving pennies. It is the difference between a live product and a closed company.

  • GitHub Copilot routes autocomplete to gpt-4o-mini and chat to gpt-4o - a 16x price difference with comparable quality for each task type
  • Jasper AI (copywriting) said publicly: LLM costs = 60%+ of revenue before optimization. Survived by introducing model tiering and per-user quotas
  • OpenAI introduced prompt caching at a 50% discount - an acknowledgment that cost became the main barrier for enterprise adoption
  • Replit cut LLM costs by 30% through intelligent routing - no quality loss for 95% of users
  • Helicone and Langfuse - LLM observability platforms - grew 5x in 2024 on the back of this exact problem

Why Cost Management Became a Discipline

Cost management for LLMs grew alongside the API economy. When **OpenAI opened its API (2020)**, pricing was per token: every input and output token has a price, and usage scales directly with traffic. That billing model made spend unpredictable for teams used to flat server costs. As products moved from prototypes to real traffic, a single unbounded prompt or one heavy user could blow up a monthly bill. Token counting, per-user quotas, model selection by budget, and spend alerting emerged as a practical discipline rather than an afterthought.

Предварительные знания

  • LLM API Integration: OpenAI, Anthropic, Open-Source Models

Anatomy of Cost: Input, Output, and Embedding Tokens

Latitude built AI-powered games. March 2024 - OpenAI invoice: USD 150,000. February was USD 2,000. A 75x spike in 60 days. No hackers. No DDoS. **The developers didn't know that output tokens cost 4 times more than input.** Long game narratives were generating 2,000-4,000 output tokens per request. No alert fired. The bill just arrived.

ModelInput ($/1M tokens)Output ($/1M tokens)RatioContext
GPT-4oUSD 2.50USD 10.001:4128K
GPT-4o-miniUSD 0.15USD 0.601:4128K
Claude Sonnet 4USD 3.00USD 15.001:5200K
Claude Haiku 3.5USD 0.80USD 4.001:5200K
Gemini 1.5 FlashUSD 0.075USD 0.301:41M
text-embedding-3-smallUSD 0.02--8K

Output tokens drain budgets quietly. A RAG chatbot with 3,500 input and 800 output tokens on GPT-4o costs USD 0.017 per request. Sounds like nothing. 100,000 requests a day - that's USD 1,675 per day, USD 50,250 per month. Same traffic on gpt-4o-mini: USD 3,015 per month. The difference is 94%. Model selection is not a matter of taste.

Counting tokens before sending a request is not paranoia. It's engineering hygiene. `tiktoken` counts in microseconds - far cheaper than discovering a surprise at month end. Langfuse and Helicone track cost per request automatically and break down spending by endpoint, model, and user ID with zero extra code.

When using GPT-4o, which cost component typically dominates in applications that generate long responses?

Prompt Compression: Reducing Input Without Losing Quality

Microsoft studied typical prompts from production systems. Conclusion: **50-70% of tokens can be removed - and the model responds identically.** This is not magic. It is how transformers work: they attend to the entire context, but most words in standard prompts just duplicate the meaning of others. LLMLingua automates exactly this. Manual compression of a system prompt takes 10 minutes - and saves money every single request.

RAG systems suffer from a different disease: all chunks land in the context indiscriminately, including irrelevant ones. A low-relevance chunk does not help - it hurts. It adds noise, bloats the context, burns the budget. Filtering by `relevanceScore >= 0.7` and enforcing a hard token budget fixes it without touching quality.

TechniqueToken SavingsQuality Loss RiskImplementation Complexity
System prompt compression40-70%LowLow
RAG chunk filtering30-60%MediumMedium
Compact few-shot50-70%LowLow
LLMLingua (auto-compression)50-70%MediumHigh
Limiting max_tokens outputVariesDepends on taskLow

Which prompt compression technique delivers the greatest savings with the least risk of quality loss?

Model Tiering: Choosing the Right Model for the Task

GitHub Copilot serves millions of requests a day. Autocomplete goes to gpt-4o-mini. Chat with code explanation goes to gpt-4o. Complex refactoring goes to claude-sonnet. Three models, three price points, one product. This is not exotic multi-model architecture - it is standard practice. **Using GPT-4o to classify a support ticket into three categories is like driving a semi-truck to buy groceries.**

Real-world model tiering numbers from a typical B2B SaaS with AI features: **60% of requests are simple** (classification, extraction), **30% are medium** (summarization, QA), **10% are complex** (reasoning, analysis). Routing simple tasks to gpt-4o-mini cuts the total bill by 40-55%. At 100K requests per day - that is the difference between USD 1,600 and USD 900 on inference alone.

Which type of task is NOT suitable for a cheap model (GPT-4o-mini)?

Budget Alerts and Quotas: Protecting Against Overspending

August 2023. A developer accidentally launched an infinite loop calling the GPT-4 API. Discovered 6 hours later - USD 12,000 gone. Incident review: no budget alerts, no per-request quotas. OpenAI's own spending limits - a blunt kill switch, not a tool. **One buggy prompt, one missing break statement - and a month's budget disappears overnight.** An application-level budget service is not an optional feature.

**OpenAI Usage Limits are not a substitute for application-level budgeting.** They work as a crude kill switch but don't provide granular control: per-user quotas, per-endpoint limits, anomaly alerting. An application-level budget service is an essential component of any production system.

What level of budget checking should be performed BEFORE sending a request to the LLM API?

Cost Dashboard: Visualization and Unit Economics

Jasper AI - an LLM-powered copywriting service. They said it publicly: LLM costs were 60%+ of revenue before optimization. They survived. But to discover that number - one metric was needed: **cost per user vs ARPU**. If the subscription is USD 10 per month and LLM per user costs USD 6 - that is not a business. That is a subsidized service.

Langfuse and Helicone are observability platforms built specifically for LLMs. They track cost per request, break it down by model, endpoint, and user ID. Helicone ships spending alerts out of the box. Langfuse supports attaching a userId to every trace and computing unit economics automatically. For early stages - a Redis-based solution is enough to start.

P95 cost per query is a special metric. Average cost USD 0.005, P95 is USD 0.04. That means 5% of requests cost 8 times more than the rest. Who are they? One endpoint? One user segment? One prompt pattern? A dashboard broken down by endpoint + model + userId answers this in seconds.

MetricFormulaHealthy Range
Cost per queryTotal LLM cost / Total queriesUSD 0.001 - USD 0.05
Cost per user / monthTotal LLM cost / MAU< 30% of ARPU
LLM cost ratioLLM cost / Revenue< 15-20%
Cache hit savingsCached queries × avg costShould be growing
P95 query cost95th percentile cost< 10x average

Which metric is critical for assessing the viability of an AI product's business model?

AI cost is a fixed budget line item, like server hosting

LLM cost is a variable that depends on user behavior, prompt length, and model selection

Hosting costs the same regardless of what users do. LLM does not. One user with long requests can cost 100x more than the average. One endpoint with a bad prompt can double the monthly bill. LLM cost is a function of code, data, and behavior - and it changes with every deploy. That is exactly why per-user quotas, per-request budgets, and real-time observability via Langfuse or Helicone are not optional.

LLM Cost Management

  • Output tokens cost 4-5x more than input - first optimization: max_tokens and concise instructions in the prompt
  • Prompt compression gives 40-70% savings with no quality loss: compress system prompts, filter RAG chunks by relevanceScore, use compact few-shot examples
  • Model tiering: 60% of tasks are simple (gpt-4o-mini, USD 0.15 per 1M), 30% medium (gpt-4o), 10% complex (claude-sonnet). Total savings 40-55%
  • Three layers of protection BEFORE the request: per-request limit + daily limit + monthly limit. Post-hoc monitoring does not stop spending
  • Key business metric: LLM cost / ARPU per user. Above 20-30% - a signal for urgent optimization

What's Next

Cost management sets the budget framework. The next step is rate limiting, which protects against both overspending and exceeding API provider quotas.

  • Rate Limiting for AI APIs — Budget alerts limit money, rate limiting restricts the number of requests and tokens per unit of time
  • LLM Caching — Caching is the most effective way to reduce costs: zero cost on a cache hit
  • Observability — The cost dashboard is part of the observability pipeline. Real-time expense monitoring

Связанные уроки

  • aie-05-api-integration — Cost tracking wraps every API call
  • aie-28-caching-optimization — Caching is the primary cost reduction lever
  • aie-30-rate-limiting-ai — Rate limits cap runaway spend
  • aie-35-observability — Per-request metrics expose cost drivers
  • ml-08-regularization — Penalize expensive paths to keep within budget
  • alg-20-greedy
Cost Management: Counting Tokens, Optimizing Prompts, Choosing Models by Budget

0

1

Sign In