AI Engineering
Tokens and Context Window: Why LLMs Forget and How to Handle It
Цели урока
- Understand the BPE algorithm and why different languages require different numbers of tokens
- Count tokens using tiktoken and estimate request costs
- Understand context window constraints and the 'Lost in the Middle' problem
- Master strategies for long context: truncation, sliding window, summarization, RAG
- Plan a token budget for production systems
Cursor IDE - an AI-assisted IDE for USD 10/month. With the naive approach (sending the entire project with each request) the cost basis would be USD 100+/month. Cursor solved token management: sliding window, intelligent context selection, aggressive caching. The difference between a profitable and an unprofitable product is understanding tokens.
- Stripe optimized prompts and cut API costs by 40% - just by removing redundant tokens
- Notion AI handles knowledge bases of millions of words via RAG, not a giant context
- Cursor IDE uses sliding window + summarization to maintain context in large projects - otherwise the product would be unprofitable at USD 10/month
- Non-English prompts are 30-50% more expensive due to BPE characteristics - not a typo, just math
Historical context
In 2016 Rico Sennrich at the University of Edinburgh adapted the BPE algorithm (originally proposed in 1994 for data compression) to neural machine translation. The paper 'Neural Machine Translation of Rare Words with Subword Units' allowed translation systems to handle rare and unknown words - replacing the '<UNK>' token. GPT, BERT, Claude, LLaMA - all use BPE or its variants. A 1994 data-compression algorithm became the foundation of LLM tokenization.
Предварительные знания
BPE: how the model splits text into tokens
**The difference between USD 50/month and USD 5,000/month on the same product** is understanding exactly how text becomes numbers. 1 token is roughly 4 characters or 0.75 words. GPT-4o context window is 128K tokens - that is ~96K words, the entire Lord of the Rings fits with room to spare. But it costs USD 0.32 if filled to the brim. Stripe reportedly cut AI costs by 40% simply by starting to analyze usage.
LLMs do not read characters or words. They read **tokens** - fragments produced by **Byte Pair Encoding (BPE)**. This algorithm powers the tokenizers behind GPT, Claude, and most modern LLMs. It determines the cost of every request.
| Tokenizer | Model | Vocabulary size |
|---|---|---|
| cl100k_base | GPT-4, GPT-4o | ~100,000 |
| o200k_base | GPT-4o-mini | ~200,000 |
| Claude tokenizer | Claude 3/3.5/4 | ~100,000 |
| SentencePiece | LLaMA, Mistral | 32,000 - 128,000 |
Practical consequence of BPE: **different languages have different token density**. English, on which most models were trained, tokenizes compactly. Russian, Chinese, Arabic require more tokens for the same information volume. Non-English prompts cost more - not a bug, just math.
Why does 'unhappiness' split into 3 tokens while 'cat' is just 1?
Token counting: tiktoken and practical tools
KV-cache is why context costs money. Every token in history is a matrix stored in GPU memory for the entire generation duration. Sending 100K tokens = occupying 8-16 GB VRAM just for the conversation's working memory. Input tokens are not free, and counting them is an engineering obligation, not an option.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Approximate cost per request |
|---|---|---|---|
| GPT-4o | USD 2.50 | USD 10.00 | USD 0.01 - USD 0.05 |
| GPT-4o-mini | USD 0.15 | USD 0.60 | USD 0.0005 - USD 0.003 |
| Claude 3.5 Sonnet | USD 3.00 | USD 15.00 | USD 0.01 - USD 0.08 |
| Claude 3.5 Haiku | USD 0.80 | USD 4.00 | USD 0.003 - USD 0.02 |
**Output tokens are always more expensive than input tokens** (3-5x). So `max_tokens` in the API is not just about response length - it is a budget lever. A request asking for a 2,000-word essay costs 10x more than one asking for a single word.
An API request to GPT-4o used 500 prompt_tokens and 1,000 completion_tokens. What share of cost comes from output?
Context Window: the memory boundary of LLMs
Context window is the **maximum number of tokens in one call**. Everything counts: system prompt, conversation history, user message, **and** the model's response. Not input and output tracked separately - all together, in one bucket.
| Model | Context Window | Approx. pages | Approx. lines of code |
|---|---|---|---|
| GPT-3.5 | 4K / 16K | 6 - 24 | 200 - 800 |
| GPT-4o | 128K | 190 | 6,400 |
| Claude 3.5 Sonnet | 200K | 300 | 10,000 |
| Gemini 1.5 Pro | 1M / 2M | 1,500 - 3,000 | 50,000 - 100,000 |
128K tokens sounds enormous. In practice the context window fills faster than expected - especially when users start pasting files and code. One 500-line file is roughly 3,000 tokens - 2.3% of GPT-4o's context.
**'Lost in the Middle' problem.** Research (Liu et al., 2023) shows LLMs perform worse on information from the **middle** of a long context. Beginning and end are processed better. Simply increasing the context window is not a silver bullet. Placing critical information strategically matters more.
Another trap: **cost grows linearly** with context size. Sending 100K tokens to GPT-4o costs USD 0.25 per request. If a user asks 50 questions per hour - USD 12.50/hour for a single user. Cursor IDE and Copilot solve this through aggressive token management.
Larger context window = smarter model and better answers
Context window is working memory capacity, not an intelligence metric. Gemini with 2M tokens is not smarter than GPT-4o with 128K - it just holds more text at once
The 'Lost in the Middle' problem proves the opposite: overly long context degrades quality for information in the middle. A smart RAG architecture with 8K tokens often beats a naive 'dump everything' approach with 200K.
System prompt occupies 1,000 tokens, max_tokens = 4,096, context window = 128K. The user sends a document of 130,000 tokens. What happens?
Strategies for long context
Sooner or later data stops fitting. Netflix processes descriptions for thousands of movies. Stripe analyzes long transaction chains. Notion AI works with knowledge bases containing millions of words. None of them 'dumps everything into context'. It is expensive, slow, and performs worse than it looks.
**Production typically combines approaches.** For example: sliding window for chat history + RAG for the knowledge base + summarization for long conversations. Cursor IDE does exactly this - which is how it remains profitable at USD 10/month.
A customer support chatbot uses a knowledge base of 10,000 articles. Which context strategy is the best fit?
Token Budgeting: controlling costs in production
Token budgeting is the practice of allocating the context window across request components. Without a clear budget the system consumes itself: the system prompt grew to 5,000 tokens, RAG context took another 20,000, 200 tokens remain for the response - the model generates truncated nonsense. Not because it is bad. The space just ran out.
Monitoring token usage is a mandatory part of any production system. Without it, controlling costs and forecasting scaling is impossible. Stripe reportedly cut AI spending by 40% simply by starting to log and analyze usage - without changing a single prompt.
**Three key metrics to monitor:** 1. Average cost per request by model and endpoint 2. Percentage of requests hitting the token limit 3. Input-to-output token ratio. Anomalies in these metrics are the first signal of trouble.
In a production RAG chatbot, conversation history grows with each message. Which context management strategy is most correct?
Key concepts
- BPE (Byte Pair Encoding) iteratively merges frequent character pairs. Frequent words = 1 token, rare ones split. tiktoken reproduces this exactly
- Non-English text is 30-50% more expensive - KV-cache stores every token in GPU memory, making this real money
- Context window = system prompt + history + query + response. All together. Exceeding it = API error, not 'trimming'
- Output tokens are 3-5x more expensive than input tokens. max_tokens is a budget lever, not just a length control
- 'Lost in the Middle' - more context does not mean better results. Information in the middle is processed worse
- 4 strategies: truncation (simple), sliding window (smarter), summarization (costly), RAG (scalable). Production uses combinations
Вопросы для размышления
- Cursor IDE sells for USD 10/month. Without solving the token problem, the cost basis per user would be USD 100+. How exactly does token budgeting and RAG change this math? What would happen to Cursor's business model without these techniques?
What's Next
Tokens and the context window are the base for every topic that follows. The next step is applying this knowledge in real API calls.
- LLM API Integration — Applying token budgeting in real OpenAI/Anthropic API calls
- RAG: Retrieval-Augmented Generation — The main strategy for data that doesn't fit in the context window
Связанные уроки
- aie-03-llm-fundamentals — Transformer architecture and KV-cache - the physical reason context costs memory
- aie-05-api-integration — Applying token budgeting in real OpenAI/Anthropic API calls
- aie-12-rag-fundamentals — RAG - the primary strategy for data that exceeds the context window
- aie-15-conversation-memory — Conversation memory management - practical sliding window and summarization
- nlp-01 — NLP tokenization - BPE roots in classical text processing
- prob-25-info-theory