AI Engineering

Cost Management: Counting Tokens, Optimizing Prompts, Choosing Models by Budget

Цели урока

Calculate LLM request costs accounting for input/output tokens across different models
Apply prompt compression techniques to reduce input tokens by 40-70%
Implement model tiering - routing tasks to a model of appropriate power
Build a budget alert system with per-request, daily, and monthly limits
Create a cost dashboard with unit economics for informed business decisions

A startup launched an AI feature. One endpoint, one bug in a prompt - 10,000 tokens per request instead of 500. A month later: a USD 50K bill instead of the expected USD 5K. One user with unusual behavior was generating 80% of all spending. No per-user quotas, no spending alerts, no cost per request tracking. Cost management is not about saving pennies. It is the difference between a live product and a closed company.

GitHub Copilot routes autocomplete to gpt-4o-mini and chat to gpt-4o - a 16x price difference with comparable quality for each task type
Jasper AI (copywriting) said publicly: LLM costs = 60%+ of revenue before optimization. Survived by introducing model tiering and per-user quotas
OpenAI introduced prompt caching at a 50% discount - an acknowledgment that cost became the main barrier for enterprise adoption
Replit cut LLM costs by 30% through intelligent routing - no quality loss for 95% of users
Helicone and Langfuse - LLM observability platforms - grew 5x in 2024 on the back of this exact problem

Why Cost Management Became a Discipline

Cost management for LLMs grew alongside the API economy. When **OpenAI opened its API (2020)**, pricing was per token: every input and output token has a price, and usage scales directly with traffic. That billing model made spend unpredictable for teams used to flat server costs. As products moved from prototypes to real traffic, a single unbounded prompt or one heavy user could blow up a monthly bill. Token counting, per-user quotas, model selection by budget, and spend alerting emerged as a practical discipline rather than an afterthought.

Предварительные знания

LLM API Integration: OpenAI, Anthropic, Open-Source Models

Anatomy of Cost: Input, Output, and Embedding Tokens

Latitude built AI-powered games. March 2024 - OpenAI invoice: USD 150,000. February was USD 2,000. A 75x spike in 60 days. No hackers. No DDoS. **The developers didn't know that output tokens cost 4 times more than input.** Long game narratives were generating 2,000-4,000 output tokens per request. No alert fired. The bill just arrived.

Model	Input ($/1M tokens)	Output ($/1M tokens)	Ratio	Context
GPT-4o	USD 2.50	USD 10.00	1:4	128K
GPT-4o-mini	USD 0.15	USD 0.60	1:4	128K
Claude Sonnet 4	USD 3.00	USD 15.00	1:5	200K
Claude Haiku 3.5	USD 0.80	USD 4.00	1:5	200K
Gemini 1.5 Flash	USD 0.075	USD 0.30	1:4	1M
text-embedding-3-small	USD 0.02	-	-	8K

Output tokens drain budgets quietly. A RAG chatbot with 3,500 input and 800 output tokens on GPT-4o costs USD 0.017 per request. Sounds like nothing. 100,000 requests a day - that's USD 1,675 per day, USD 50,250 per month. Same traffic on gpt-4o-mini: USD 3,015 per month. The difference is 94%. Model selection is not a matter of taste.

Counting tokens before sending a request is not paranoia. It's engineering hygiene. `tiktoken` counts in microseconds - far cheaper than discovering a surprise at month end. Langfuse and Helicone track cost per request automatically and break down spending by endpoint, model, and user ID with zero extra code.

When using GPT-4o, which cost component typically dominates in applications that generate long responses?

Prompt Compression: Reducing Input Without Losing Quality

Microsoft studied typical prompts from production systems. Conclusion: **50-70% of tokens can be removed - and the model responds identically.** This is not magic. It is how transformers work: they attend to the entire context, but most words in standard prompts just duplicate the meaning of others. LLMLingua automates exactly this. Manual compression of a system prompt takes 10 minutes - and saves money every single request.

RAG systems suffer from a different disease: all chunks land in the context indiscriminately, including irrelevant ones. A low-relevance chunk does not help - it hurts. It adds noise, bloats the context, burns the budget. Filtering by `relevanceScore >= 0.7` and enforcing a hard token budget fixes it without touching quality.

Technique	Token Savings	Quality Loss Risk	Implementation Complexity
System prompt compression	40-70%	Low	Low
RAG chunk filtering	30-60%	Medium	Medium
Compact few-shot	50-70%	Low	Low
LLMLingua (auto-compression)	50-70%	Medium	High
Limiting max_tokens output	Varies	Depends on task	Low

Which prompt compression technique delivers the greatest savings with the least risk of quality loss?

Model Tiering: Choosing the Right Model for the Task

GitHub Copilot serves millions of requests a day. Autocomplete goes to gpt-4o-mini. Chat with code explanation goes to gpt-4o. Complex refactoring goes to claude-sonnet. Three models, three price points, one product. This is not exotic multi-model architecture - it is standard practice. **Using GPT-4o to classify a support ticket into three categories is like driving a semi-truck to buy groceries.**

Real-world model tiering numbers from a typical B2B SaaS with AI features: **60% of requests are simple** (classification, extraction), **30% are medium** (summarization, QA), **10% are complex** (reasoning, analysis). Routing simple tasks to gpt-4o-mini cuts the total bill by 40-55%. At 100K requests per day - that is the difference between USD 1,600 and USD 900 on inference alone.

Which type of task is NOT suitable for a cheap model (GPT-4o-mini)?

Budget Alerts and Quotas: Protecting Against Overspending

August 2023. A developer accidentally launched an infinite loop calling the GPT-4 API. Discovered 6 hours later - USD 12,000 gone. Incident review: no budget alerts, no per-request quotas. OpenAI's own spending limits - a blunt kill switch, not a tool. **One buggy prompt, one missing break statement - and a month's budget disappears overnight.** An application-level budget service is not an optional feature.

**OpenAI Usage Limits are not a substitute for application-level budgeting.** They work as a crude kill switch but don't provide granular control: per-user quotas, per-endpoint limits, anomaly alerting. An application-level budget service is an essential component of any production system.

What level of budget checking should be performed BEFORE sending a request to the LLM API?

Cost Dashboard: Visualization and Unit Economics

Jasper AI - an LLM-powered copywriting service. They said it publicly: LLM costs were 60%+ of revenue before optimization. They survived. But to discover that number - one metric was needed: **cost per user vs ARPU**. If the subscription is USD 10 per month and LLM per user costs USD 6 - that is not a business. That is a subsidized service.

Langfuse and Helicone are observability platforms built specifically for LLMs. They track cost per request, break it down by model, endpoint, and user ID. Helicone ships spending alerts out of the box. Langfuse supports attaching a userId to every trace and computing unit economics automatically. For early stages - a Redis-based solution is enough to start.

P95 cost per query is a special metric. Average cost USD 0.005, P95 is USD 0.04. That means 5% of requests cost 8 times more than the rest. Who are they? One endpoint? One user segment? One prompt pattern? A dashboard broken down by endpoint + model + userId answers this in seconds.

Metric	Formula	Healthy Range
Cost per query	Total LLM cost / Total queries	USD 0.001 - USD 0.05
Cost per user / month	Total LLM cost / MAU	< 30% of ARPU
LLM cost ratio	LLM cost / Revenue	< 15-20%
Cache hit savings	Cached queries × avg cost	Should be growing
P95 query cost	95th percentile cost	< 10x average

Which metric is critical for assessing the viability of an AI product's business model?

AI cost is a fixed budget line item, like server hosting

LLM cost is a variable that depends on user behavior, prompt length, and model selection

Hosting costs the same regardless of what users do. LLM does not. One user with long requests can cost 100x more than the average. One endpoint with a bad prompt can double the monthly bill. LLM cost is a function of code, data, and behavior - and it changes with every deploy. That is exactly why per-user quotas, per-request budgets, and real-time observability via Langfuse or Helicone are not optional.

LLM Cost Management

Output tokens cost 4-5x more than input - first optimization: max_tokens and concise instructions in the prompt
Prompt compression gives 40-70% savings with no quality loss: compress system prompts, filter RAG chunks by relevanceScore, use compact few-shot examples
Model tiering: 60% of tasks are simple (gpt-4o-mini, USD 0.15 per 1M), 30% medium (gpt-4o), 10% complex (claude-sonnet). Total savings 40-55%
Three layers of protection BEFORE the request: per-request limit + daily limit + monthly limit. Post-hoc monitoring does not stop spending
Key business metric: LLM cost / ARPU per user. Above 20-30% - a signal for urgent optimization

What's Next

Cost management sets the budget framework. The next step is rate limiting, which protects against both overspending and exceeding API provider quotas.

Rate Limiting for AI APIs — Budget alerts limit money, rate limiting restricts the number of requests and tokens per unit of time
LLM Caching — Caching is the most effective way to reduce costs: zero cost on a cache hit
Observability — The cost dashboard is part of the observability pipeline. Real-time expense monitoring

Связанные уроки

aie-05-api-integration — Cost tracking wraps every API call
aie-28-caching-optimization — Caching is the primary cost reduction lever
aie-30-rate-limiting-ai — Rate limits cap runaway spend
aie-35-observability — Per-request metrics expose cost drivers
ml-08-regularization — Penalize expensive paths to keep within budget
alg-20-greedy