Generative AI
GPT Architecture
Предварительные знания
- Self-attention and multi-head attention mechanics
- How tokens become embedding vectors
February 2023: ChatGPT reaches 100 million users in 2 months - faster than any product in history. Behind it is GPT-3.5 - a decoder-only transformer with 175 billion parameters. Architecture: 96 layers, 96 attention heads, dimension 12288. Trained on 570GB of text over several months on thousands of A100 GPUs. But the key insight is not scale. It is the right architecture: causal attention lets the model learn to predict the next token, which leads to emergent abilities - reasoning, code generation, instruction following.
- **GitHub Copilot**: GPT architecture for code completion - 1 million paid subscribers in the first year, $100M ARR
- **Cursor IDE**: GPT-4 based code assistant - uses KV-cache for fast streaming, context window management for codebases
- **Claude (Anthropic)**: decoder-only with RoPE positioning and extended context windows up to 200K tokens
Historical context
In June 2018 OpenAI published 'Improving Language Understanding by Generative Pre-Training' - the first GPT. Months later Google countered with BERT (encoder-only, masked language modeling); OpenAI stayed with the decoder-only, causal-attention, generative pre-training recipe. GPT-1: 117M parameters, 12 layers. A year later - GPT-2 (1.5B), so 'dangerous' that OpenAI initially withheld the weights. In 2020 - GPT-3 (175B): few-shot learning without fine-tuning. In 2022 InstructGPT added RLHF. In 2023 ChatGPT changed the world. In 5 years one architectural choice (decoder-only + scale) became an industrial revolution.
Decoder-only: autoregressive architecture
GPT uses a **decoder-only** Transformer unlike the original encoder-decoder architecture (Vaswani 2017). No encoder - just a stack of decoder blocks. The task: autoregressive language modeling - predict the next token from the previous ones. Pre-training: minimize cross-entropy loss over next-token prediction on trillions of tokens.
**Pre-norm vs Post-norm**: the original Transformer (2017) used Post-norm (LayerNorm after residual). GPT uses **Pre-norm** (LayerNorm before attention/MLP). Pre-norm is more stable for training deep models - gradients do not explode. **Weight tying**: the token embedding and LM head share the weight matrix - memory savings and better generalization.
Why does GPT use a decoder-only architecture instead of a full encoder-decoder (like T5 or BART)?
Causal Self-Attention: masking the future
Causal (masked) self-attention is the key difference between decoder and encoder. Each token can only 'see' previous tokens (and itself), not future ones. This is implemented via a **causal mask**: the upper triangle of the attention weight matrix is filled with -inf before softmax. After softmax, the corresponding weights become 0.
**Flash Attention** (Dao et al., 2022) - memory-efficient causal attention: instead of materializing the full [T, T] matrix in HBM, it computes attention in tiles in SRAM. Memory: O(T) instead of O(T^2). Speed: 2-4x faster than the standard implementation. This made 100K+ token contexts feasible.
Why is a causal mask necessary during training but not conceptually during inference?
KV-Cache: accelerating autoregressive generation
During autoregressive generation GPT produces one token at a time. Without optimization: each new token requires recomputing K and V for all previous tokens - O(T^2) time complexity to generate T tokens. **KV-Cache**: store the K and V vectors of previous tokens; when a new token is added, only compute its K and V and concatenate with the cache.
KV-cache consumes O(T * n_layers * n_heads * head_dim) memory. For LLaMA-2-70B with seq_len=4096 and batch_size=32 that is ~350GB - larger than the model weights. Optimizations: **Grouped Query Attention** (GQA) groups Q heads sharing KV pairs; **Multi-Query Attention** (MQA) uses a single KV for all Q heads.
Why is KV-cache unnecessary during training but critical at inference?
RoPE: Rotary Position Embedding
GPT-2 uses absolute learned positional embeddings, capped at the maximum training length. **RoPE** (Rotary Position Embedding, Su et al., 2021) encodes position by rotating Q and K vectors: the dot product Q_i dot K_j automatically depends only on the position difference (i - j). This enables extrapolation beyond the training length.
RoPE is used in LLaMA, Mistral, Qwen, Gemma, and most modern open-source LLMs. **YaRN** (Yet another RoPE extensioN) and **LongRoPE** scale RoPE to 128K+ tokens via frequency interpolation. This lets LLaMA-3.1 operate with 128K tokens having been trained on 8K.
GPT generates the entire response in parallel, like an encoder processing input text
GPT generates one token at a time sequentially - each token depends on all previous ones; parallelism is only at training, not inference
Autoregressive generation is O(T) sequential forward passes; this is the main bottleneck for LLM latency and the reason KV-cache is critical
What is the main advantage of RoPE over the absolute positional embeddings used in GPT-2?
Related Topics
The GPT architecture sits in the middle of the generative AI pipeline:
- Tokenization: BPE, SentencePiece — The embedding layer maps token IDs to vectors; vocabulary size sets the embedding matrix dimensions
- LLM Training: Pre-training — This decoder-only architecture is what gets trained on trillions of tokens via next-token prediction
- Language Models: from n-gram to GPT — Causal attention generalizes the n-gram idea of conditioning on prior context, with unlimited range
Key ideas
- **Decoder-only**: stack of transformer blocks with pre-norm, weight tying, autoregressive generation one token at a time
- **Causal attention**: upper triangular mask blocks future tokens - necessary for parallel training on sequences
- **KV-Cache**: storing K/V vectors of previous tokens - O(T) instead of O(T^2) during generation
- **RoPE**: rotating Q/K through sin/cos - relative positioning and extrapolation beyond training length
Вопросы для размышления
- Why does weight tying (shared weights for token embeddings and LM head) improve generalization rather than cause underfitting?
- How does Grouped Query Attention (GQA) reduce KV-cache memory consumption at inference without significant quality loss?
- How does RoPE via frequency interpolation/extrapolation (YaRN) allow the model to work with contexts longer than it was trained on?
Связанные уроки
- gai-03 — Encoder-decoder architecture and attention basics - the foundation GPT builds on
- gai-05 — Fine-tuning and RLHF operate on top of the pretrained GPT architecture
- aie-03-llm-fundamentals — Understanding decoder-only architecture is directly needed for LLM API work and prompt engineering
- dl-03 — GPT extends Transformer blocks from dl-03: same Q/K/V attention and MLP blocks, but with causal mask
- nlp-04 — Token embeddings in GPT are analogous to Word2Vec - dense vectors, but contextual rather than static
- dl-01