AI Engineering
How LLMs Work: Tokens, Embeddings, Attention
Цели урока
- Understand how LLMs generate text (autoregressive generation) and why this explains hallucinations
- Learn tokenization (BPE, tiktoken) and why it directly affects cost
- Grasp embeddings: how meaning becomes a vector and why king - man + woman = queen
- Understand attention: Query-Key-Value, multi-head, KV-cache, FlashAttention
- Learn to choose temperature and sampling strategy for different tasks
Предварительные знания
- The AI industry map: models, providers, and their differences
- Understanding the AI Backend Engineer role
2017. Google Brain. Eight authors publish a 15-page paper introducing the Transformer - the architecture where attention alone replaces recurrence. Three years later GPT-3 arrives with 175 billion parameters, and few-shot learning emerges on its own - nobody programmed it. One rule: predict the next token from 50,257 options. That mechanism now writes code in every editor on the planet. Understanding it from the inside is the difference between treating AI as a black box and actually engineering with it.
- **GitHub Copilot** generates one token at a time - autoregressive in real time, directly in the editor
- A 500-word Russian prompt is roughly 1,500 tokens; the same text in English is roughly 600. Price difference: 2.5x
- **RAG at Perplexity**: documents as embeddings in Qdrant, cosine similarity at every query across 100M+ users
- AI Engineer interview, question one across every company: explain the attention mechanism
The Transformer Paper
2017. Google Brain. Eight authors - Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin. One paper introducing the Transformer architecture, with the famous claim that attention alone is enough. Before it: RNN and LSTM processed text sequentially, word by word. Long-range dependencies faded with distance. Transformer scrapped the whole paradigm: all tokens processed in parallel via self-attention. Training sped up by orders of magnitude. One year later - BERT. Another year - GPT-2. 2020: Brown et al. published GPT-3 with 175B parameters - and few-shot learning emerged on its own, without any explicit training for it. Six of the eight original authors left Google and founded startups: Aidan Gomez built Cohere, Noam Shazeer founded Character.AI - which Google later bought back for 2.7 billion dollars.
LLM - A Next-Token Prediction Machine
An LLM does not understand language. It distributes probability mass across 50,257 possible next tokens. Every time. Every step. And it works.
**Large Language Model** - a neural network trained on terabytes of internet text. One job: **predict the next token**. Not understand. Not recall. Predict - off statistical patterns seen billions of times during training.
"Paris" gets appended to the context - and the loop kicks off again. Next token. Another full forward pass through every layer of the network. This is **autoregressive generation**: each step builds on the previous one, token by token.
**Hallucinations are not a bug.** They fall straight out of how the model works. It generates a plausible continuation - not a true one. If training data contained "Einstein discovered relativity in 1905" thousands of times, the model reproduces that pattern. Whether it fits the current context never enters the equation.
An LLM generates a response:
Tokenization: How Text Becomes Numbers
Neural networks do not chew on letters. They need numbers. Between raw text and the network sits a **tokenizer** - and it sets the dollar amount on every request.
A token is not a word, not a character. It is a **chunk of text** optimized by frequency. GPT-4 uses the `cl100k_base` tokenizer with a **vocabulary of 50,257 tokens** - built by BPE (Byte Pair Encoding): an algorithm that repeatedly merges the most frequent character pairs until the target vocabulary size is hit. Common words become single tokens. Rare words shatter into fragments.
**Rule of thumb:** 1 token is roughly 4 characters in English, roughly 1-2 characters in non-Latin scripts (Russian, Chinese, Arabic). Non-English prompts cost more - more tokens for the same amount of text.
GPT-4o costs USD 2.50 per million input tokens. Sounds tiny - until the math lands: 100,000 requests at 500 tokens each = 50M tokens = USD 125/month on input alone. Bloated prompts with redundant instructions double that fast.
| Model | Input (per 1M tokens) | Output (per 1M) | Context |
|---|---|---|---|
| GPT-4o | USD 2.50 | USD 10.00 | 128K |
| GPT-4o-mini | USD 0.15 | USD 0.60 | 128K |
| Claude 3.5 Sonnet | USD 3.00 | USD 15.00 | 200K |
| Llama 3 70B (self-hosted) | ~USD 0.50 | ~USD 1.00 | 8K |
Text in non-Latin scripts (e.g., Russian, Chinese) typically uses more tokens than equivalent English text because:
Embeddings: Meaning as a Vector
A token ID like 15339 means nothing on its own. The network needs meaning. The embedding layer delivers it: a lookup table mapping each of the 50,257 tokens to a dense vector of floating-point numbers.
GPT-2: 768 dimensions per token. GPT-3: 12,288. Modern models: thousands. This table trains alongside the rest of the network - gradients flow backward through it, and each token's vector nudges with every use. Billions of micro-adjustments later, language structure ends up encoded as geometry.
Not a trick - geometry at scale. The same mathematics runs inside Spotify's recommendation engine and Netflix's content system: tracks and movies represented as vectors, similarity measured by cosine distance. One abstraction, many domains.
**Embeddings are the foundation of RAG.** In Retrieval-Augmented Generation, documents are converted to vectors via `text-embedding-3-small` (1536 dimensions) and stored in a vector database (Qdrant, pgvector, Pinecone). Queries are embedded the same way and matched by cosine similarity. The model gets the retrieved context and generates from it. More on this in the RAG lesson.
Cosine similarity between embeddings of "JavaScript is a programming language" and "JS is a PL for the web" will be:
Attention: How the Model Understands Context
Embeddings encode tokens in isolation. But "river bank" and "bank account" share the same word with two different meanings. Static embeddings cannot tell them apart. Nobody had a clean fix - until the attention mechanism arrived.
**Self-Attention** lets each token look at every other token and update its representation based on context. Mechanically: each token gets three vectors - Query, Key, Value. Q dot-products with all Ks to produce attention weights. Weighted sum over Vs yields the updated embedding. One attention "head" - run in parallel across many heads per layer.
GPT-4 runs **multi-head attention**: 96 parallel heads per layer, across 120 layers. Each head learns its own patterns - one for syntax, another for coreference, another for factual associations. At inference time the **KV-cache** stores Key and Value vectors for already-processed tokens - no recomputation needed. That is why long contexts burn through GPU memory fast.
**O(n²) in sequence length.** Attention pairs every token with every other. A 128K token context = 16 billion token pairs. FlashAttention (used in vLLM and most inference servers) softens the blow via tiling - computes attention in blocks without materializing the full matrix. But quadratic memory growth is fundamental - no algorithm fully escapes it.
What is the attention mechanism in Transformers for?
Temperature and Sampling: Controlling Output
After all the layers, the network spits out a vector of 50,257 raw scores - logits. Softmax converts them into probabilities. One question remains: how does a token get picked? Grab the highest-probability one (greedy decoding)? Or sample?
Greedy decoding is deterministic but brittle - models get stuck in repetition loops. **Temperature** fixes this by scaling logits before softmax. Lower temperature: distribution sharpens (one option dominates). Higher: distribution flattens (all options compete). Temperature=1.0 uses the model's original distribution - the one it was trained with.
**top_p (nucleus sampling)** takes a different angle: keep the smallest set of tokens whose cumulative probability >= p, sample only from those. At `top_p=0.9`, rare outlier tokens get cut off. **top_k** is simpler: keep the k most probable tokens. Production rule: do not stack temperature and top_p together - pick one and tune it.
| Task | Temperature | Why |
|---|---|---|
| JSON extraction | 0 | Need an exact, reproducible result |
| Classification | 0 | One correct answer |
| Chatbot / assistant | 0.5-0.7 | Varied but sensible |
| Creative writing | 0.8-1.2 | Need variety and unexpected twists |
| Brainstorming | 1.0-1.5 | Maximum diversity of ideas |
**For production:** start with temperature=0 for all tasks. Increase only if responses are too repetitive. Speculative decoding (used in vLLM) can speed up inference up to 3x: a small draft model proposes several tokens, the large model verifies them in one batch - compatible with both sampling modes.
Building an API that extracts structured data (name, email, phone) from text. What temperature to choose?
LLMs "know" answers - they have a knowledge base
LLMs reproduce statistical patterns from training data. No database - only neural network weights
The difference between search and generation. Google retrieves - LLMs generate plausible continuations. That is why a model confidently cites nonexistent papers: "author + title + year + URL" is a statistically plausible pattern. RAG solves this by placing real context before generation.
Temperature = randomness. Higher temperature = worse answers
Temperature controls the shape of the probability distribution. Different tasks call for different values
At temperature=0, the model always picks the most probable token - deterministic but prone to repetition loops. At temperature=1.0, sampling from the model's original distribution - exactly what it was trained on. Values above 1.0 flatten the distribution, useful for brainstorming, dangerous for factual tasks.
Key Concepts
- LLM = autoregressive token predictor: 50,257 options, each step one forward pass through the network
- Tokenization (BPE, cl100k_base): token != word. Non-Latin text is 2-3x more expensive than English
- Embeddings: meaning as vectors. text-embedding-3-small = 1536 dimensions, cosine similarity = semantic proximity
- Attention (QKV, multi-head, KV-cache): each token sees all others. O(n²) - why long context costs more
- Temperature + nucleus sampling: 0 for extraction, 0.7 for conversations, 1.0+ for creative work
- From GPT-1 (117M parameters) to GPT-4 (~1.8T) - the same principle. Scale made something resembling intelligence emerge
Вопросы для размышления
- If an LLM generates one token at a time, why does the streaming API feel so smooth? What happens between each token?
- Embeddings enable semantic search. Why does Google search still often lose to RAG systems on quality?
- At temperature=0 the model is deterministic. Does that mean two identical requests always return identical answers?
What's Next
The mechanism is clear. Next step: control it through the API and build production systems around it.
- LLM API Integration — From theory to practice - connecting GPT to Node.js
- Embeddings and Vector DB — Embeddings from this lesson - the foundation of RAG and semantic search
- Tokens and Context Window — A closer look at KV-cache, context limits and strategies for working with them
Связанные уроки
- aie-02-ai-landscape — AI landscape and provider pricing before diving into LLM internals
- aie-04-tokens-context-window — KV-cache and context limits follow directly from token mechanics
- aie-09-embeddings — Embeddings from this lesson are the foundation of RAG and vector search
- aie-12-rag-fundamentals — RAG applies embeddings and attention mechanics at production scale
- nlp-01 — NLP language models and LLMs share the same core task: text probability
- aie-06-prompt-patterns — Prompt engineering builds on understanding temperature and sampling
- dl-03 — Neural network architecture provides context for Transformer layers
- ml-01