AI Engineering

How LLMs Work: Tokens, Embeddings, Attention

Цели урока

Understand how LLMs generate text (autoregressive generation) and why this explains hallucinations
Learn tokenization (BPE, tiktoken) and why it directly affects cost
Grasp embeddings: how meaning becomes a vector and why king - man + woman = queen
Understand attention: Query-Key-Value, multi-head, KV-cache, FlashAttention
Learn to choose temperature and sampling strategy for different tasks

Предварительные знания

The AI industry map: models, providers, and their differences
Understanding the AI Backend Engineer role

2017. Google Brain. Eight authors publish a 15-page paper introducing the Transformer - the architecture where attention alone replaces recurrence. Three years later GPT-3 arrives with 175 billion parameters, and few-shot learning emerges on its own - nobody programmed it. One rule: predict the next token from 50,257 options. That mechanism now writes code in every editor on the planet. Understanding it from the inside is the difference between treating AI as a black box and actually engineering with it.

**GitHub Copilot** generates one token at a time - autoregressive in real time, directly in the editor
A 500-word Russian prompt is roughly 1,500 tokens; the same text in English is roughly 600. Price difference: 2.5x
**RAG at Perplexity**: documents as embeddings in Qdrant, cosine similarity at every query across 100M+ users
AI Engineer interview, question one across every company: explain the attention mechanism

The Transformer Paper

2017. Google Brain. Eight authors - Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin. One paper introducing the Transformer architecture, with the famous claim that attention alone is enough. Before it: RNN and LSTM processed text sequentially, word by word. Long-range dependencies faded with distance. Transformer scrapped the whole paradigm: all tokens processed in parallel via self-attention. Training sped up by orders of magnitude. One year later - BERT. Another year - GPT-2. 2020: Brown et al. published GPT-3 with 175B parameters - and few-shot learning emerged on its own, without any explicit training for it. Six of the eight original authors left Google and founded startups: Aidan Gomez built Cohere, Noam Shazeer founded Character.AI - which Google later bought back for 2.7 billion dollars.

LLM - A Next-Token Prediction Machine

An LLM does not understand language. It distributes probability mass across 50,257 possible next tokens. Every time. Every step. And it works.

**Large Language Model** - a neural network trained on terabytes of internet text. One job: **predict the next token**. Not understand. Not recall. Predict - off statistical patterns seen billions of times during training.

"Paris" gets appended to the context - and the loop kicks off again. Next token. Another full forward pass through every layer of the network. This is **autoregressive generation**: each step builds on the previous one, token by token.

**Hallucinations are not a bug.** They fall straight out of how the model works. It generates a plausible continuation - not a true one. If training data contained "Einstein discovered relativity in 1905" thousands of times, the model reproduces that pattern. Whether it fits the current context never enters the equation.

An LLM generates a response:

Tokenization: How Text Becomes Numbers

Neural networks do not chew on letters. They need numbers. Between raw text and the network sits a **tokenizer** - and it sets the dollar amount on every request.

A token is not a word, not a character. It is a **chunk of text** optimized by frequency. GPT-4 uses the `cl100k_base` tokenizer with a **vocabulary of 50,257 tokens** - built by BPE (Byte Pair Encoding): an algorithm that repeatedly merges the most frequent character pairs until the target vocabulary size is hit. Common words become single tokens. Rare words shatter into fragments.

**Rule of thumb:** 1 token is roughly 4 characters in English, roughly 1-2 characters in non-Latin scripts (Russian, Chinese, Arabic). Non-English prompts cost more - more tokens for the same amount of text.

GPT-4o costs USD 2.50 per million input tokens. Sounds tiny - until the math lands: 100,000 requests at 500 tokens each = 50M tokens = USD 125/month on input alone. Bloated prompts with redundant instructions double that fast.

Model	Input (per 1M tokens)	Output (per 1M)	Context
GPT-4o	USD 2.50	USD 10.00	128K
GPT-4o-mini	USD 0.15	USD 0.60	128K
Claude 3.5 Sonnet	USD 3.00	USD 15.00	200K
Llama 3 70B (self-hosted)	~USD 0.50	~USD 1.00	8K

Text in non-Latin scripts (e.g., Russian, Chinese) typically uses more tokens than equivalent English text because:

Embeddings: Meaning as a Vector

A token ID like 15339 means nothing on its own. The network needs meaning. The embedding layer delivers it: a lookup table mapping each of the 50,257 tokens to a dense vector of floating-point numbers.

GPT-2: 768 dimensions per token. GPT-3: 12,288. Modern models: thousands. This table trains alongside the rest of the network - gradients flow backward through it, and each token's vector nudges with every use. Billions of micro-adjustments later, language structure ends up encoded as geometry.

Not a trick - geometry at scale. The same mathematics runs inside Spotify's recommendation engine and Netflix's content system: tracks and movies represented as vectors, similarity measured by cosine distance. One abstraction, many domains.

**Embeddings are the foundation of RAG.** In Retrieval-Augmented Generation, documents are converted to vectors via `text-embedding-3-small` (1536 dimensions) and stored in a vector database (Qdrant, pgvector, Pinecone). Queries are embedded the same way and matched by cosine similarity. The model gets the retrieved context and generates from it. More on this in the RAG lesson.

Cosine similarity between embeddings of "JavaScript is a programming language" and "JS is a PL for the web" will be:

Attention: How the Model Understands Context

Embeddings encode tokens in isolation. But "river bank" and "bank account" share the same word with two different meanings. Static embeddings cannot tell them apart. Nobody had a clean fix - until the attention mechanism arrived.

**Self-Attention** lets each token look at every other token and update its representation based on context. Mechanically: each token gets three vectors - Query, Key, Value. Q dot-products with all Ks to produce attention weights. Weighted sum over Vs yields the updated embedding. One attention "head" - run in parallel across many heads per layer.

GPT-4 runs **multi-head attention**: 96 parallel heads per layer, across 120 layers. Each head learns its own patterns - one for syntax, another for coreference, another for factual associations. At inference time the **KV-cache** stores Key and Value vectors for already-processed tokens - no recomputation needed. That is why long contexts burn through GPU memory fast.

**O(n²) in sequence length.** Attention pairs every token with every other. A 128K token context = 16 billion token pairs. FlashAttention (used in vLLM and most inference servers) softens the blow via tiling - computes attention in blocks without materializing the full matrix. But quadratic memory growth is fundamental - no algorithm fully escapes it.

What is the attention mechanism in Transformers for?

Temperature and Sampling: Controlling Output

After all the layers, the network spits out a vector of 50,257 raw scores - logits. Softmax converts them into probabilities. One question remains: how does a token get picked? Grab the highest-probability one (greedy decoding)? Or sample?

Greedy decoding is deterministic but brittle - models get stuck in repetition loops. **Temperature** fixes this by scaling logits before softmax. Lower temperature: distribution sharpens (one option dominates). Higher: distribution flattens (all options compete). Temperature=1.0 uses the model's original distribution - the one it was trained with.

**top_p (nucleus sampling)** takes a different angle: keep the smallest set of tokens whose cumulative probability >= p, sample only from those. At `top_p=0.9`, rare outlier tokens get cut off. **top_k** is simpler: keep the k most probable tokens. Production rule: do not stack temperature and top_p together - pick one and tune it.

Task	Temperature	Why
JSON extraction	0	Need an exact, reproducible result
Classification	0	One correct answer
Chatbot / assistant	0.5-0.7	Varied but sensible
Creative writing	0.8-1.2	Need variety and unexpected twists
Brainstorming	1.0-1.5	Maximum diversity of ideas

**For production:** start with temperature=0 for all tasks. Increase only if responses are too repetitive. Speculative decoding (used in vLLM) can speed up inference up to 3x: a small draft model proposes several tokens, the large model verifies them in one batch - compatible with both sampling modes.

Building an API that extracts structured data (name, email, phone) from text. What temperature to choose?

LLMs "know" answers - they have a knowledge base

LLMs reproduce statistical patterns from training data. No database - only neural network weights

The difference between search and generation. Google retrieves - LLMs generate plausible continuations. That is why a model confidently cites nonexistent papers: "author + title + year + URL" is a statistically plausible pattern. RAG solves this by placing real context before generation.

Temperature = randomness. Higher temperature = worse answers

Temperature controls the shape of the probability distribution. Different tasks call for different values

At temperature=0, the model always picks the most probable token - deterministic but prone to repetition loops. At temperature=1.0, sampling from the model's original distribution - exactly what it was trained on. Values above 1.0 flatten the distribution, useful for brainstorming, dangerous for factual tasks.

Key Concepts

LLM = autoregressive token predictor: 50,257 options, each step one forward pass through the network
Tokenization (BPE, cl100k_base): token != word. Non-Latin text is 2-3x more expensive than English
Embeddings: meaning as vectors. text-embedding-3-small = 1536 dimensions, cosine similarity = semantic proximity
Attention (QKV, multi-head, KV-cache): each token sees all others. O(n²) - why long context costs more
Temperature + nucleus sampling: 0 for extraction, 0.7 for conversations, 1.0+ for creative work
From GPT-1 (117M parameters) to GPT-4 (~1.8T) - the same principle. Scale made something resembling intelligence emerge

Вопросы для размышления

If an LLM generates one token at a time, why does the streaming API feel so smooth? What happens between each token?
Embeddings enable semantic search. Why does Google search still often lose to RAG systems on quality?
At temperature=0 the model is deterministic. Does that mean two identical requests always return identical answers?

What's Next

The mechanism is clear. Next step: control it through the API and build production systems around it.

LLM API Integration — From theory to practice - connecting GPT to Node.js
Embeddings and Vector DB — Embeddings from this lesson - the foundation of RAG and semantic search
Tokens and Context Window — A closer look at KV-cache, context limits and strategies for working with them

Связанные уроки

aie-02-ai-landscape — AI landscape and provider pricing before diving into LLM internals
aie-04-tokens-context-window — KV-cache and context limits follow directly from token mechanics
aie-09-embeddings — Embeddings from this lesson are the foundation of RAG and vector search
aie-12-rag-fundamentals — RAG applies embeddings and attention mechanics at production scale
nlp-01 — NLP language models and LLMs share the same core task: text probability
aie-06-prompt-patterns — Prompt engineering builds on understanding temperature and sampling
dl-03 — Neural network architecture provides context for Transformer layers
ml-01