Generative AI

GPT Architecture

Предварительные знания

Self-attention and multi-head attention mechanics
How tokens become embedding vectors

February 2023: ChatGPT reaches 100 million users in 2 months - faster than any product in history. Behind it is GPT-3.5 - a decoder-only transformer with 175 billion parameters. Architecture: 96 layers, 96 attention heads, dimension 12288. Trained on 570GB of text over several months on thousands of A100 GPUs. But the key insight is not scale. It is the right architecture: causal attention lets the model learn to predict the next token, which leads to emergent abilities - reasoning, code generation, instruction following.

**GitHub Copilot**: GPT architecture for code completion - 1 million paid subscribers in the first year, $100M ARR
**Cursor IDE**: GPT-4 based code assistant - uses KV-cache for fast streaming, context window management for codebases
**Claude (Anthropic)**: decoder-only with RoPE positioning and extended context windows up to 200K tokens

Historical context

In June 2018 OpenAI published 'Improving Language Understanding by Generative Pre-Training' - the first GPT. Months later Google countered with BERT (encoder-only, masked language modeling); OpenAI stayed with the decoder-only, causal-attention, generative pre-training recipe. GPT-1: 117M parameters, 12 layers. A year later - GPT-2 (1.5B), so 'dangerous' that OpenAI initially withheld the weights. In 2020 - GPT-3 (175B): few-shot learning without fine-tuning. In 2022 InstructGPT added RLHF. In 2023 ChatGPT changed the world. In 5 years one architectural choice (decoder-only + scale) became an industrial revolution.

Decoder-only: autoregressive architecture

GPT uses a **decoder-only** Transformer unlike the original encoder-decoder architecture (Vaswani 2017). No encoder - just a stack of decoder blocks. The task: autoregressive language modeling - predict the next token from the previous ones. Pre-training: minimize cross-entropy loss over next-token prediction on trillions of tokens.

**Pre-norm vs Post-norm**: the original Transformer (2017) used Post-norm (LayerNorm after residual). GPT uses **Pre-norm** (LayerNorm before attention/MLP). Pre-norm is more stable for training deep models - gradients do not explode. **Weight tying**: the token embedding and LM head share the weight matrix - memory savings and better generalization.

Why does GPT use a decoder-only architecture instead of a full encoder-decoder (like T5 or BART)?

Causal Self-Attention: masking the future

Causal (masked) self-attention is the key difference between decoder and encoder. Each token can only 'see' previous tokens (and itself), not future ones. This is implemented via a **causal mask**: the upper triangle of the attention weight matrix is filled with -inf before softmax. After softmax, the corresponding weights become 0.

**Flash Attention** (Dao et al., 2022) - memory-efficient causal attention: instead of materializing the full [T, T] matrix in HBM, it computes attention in tiles in SRAM. Memory: O(T) instead of O(T^2). Speed: 2-4x faster than the standard implementation. This made 100K+ token contexts feasible.

Why is a causal mask necessary during training but not conceptually during inference?

KV-Cache: accelerating autoregressive generation

During autoregressive generation GPT produces one token at a time. Without optimization: each new token requires recomputing K and V for all previous tokens - O(T^2) time complexity to generate T tokens. **KV-Cache**: store the K and V vectors of previous tokens; when a new token is added, only compute its K and V and concatenate with the cache.

KV-cache consumes O(T * n_layers * n_heads * head_dim) memory. For LLaMA-2-70B with seq_len=4096 and batch_size=32 that is ~350GB - larger than the model weights. Optimizations: **Grouped Query Attention** (GQA) groups Q heads sharing KV pairs; **Multi-Query Attention** (MQA) uses a single KV for all Q heads.

Why is KV-cache unnecessary during training but critical at inference?

RoPE: Rotary Position Embedding

GPT-2 uses absolute learned positional embeddings, capped at the maximum training length. **RoPE** (Rotary Position Embedding, Su et al., 2021) encodes position by rotating Q and K vectors: the dot product Q_i dot K_j automatically depends only on the position difference (i - j). This enables extrapolation beyond the training length.

RoPE is used in LLaMA, Mistral, Qwen, Gemma, and most modern open-source LLMs. **YaRN** (Yet another RoPE extensioN) and **LongRoPE** scale RoPE to 128K+ tokens via frequency interpolation. This lets LLaMA-3.1 operate with 128K tokens having been trained on 8K.

GPT generates the entire response in parallel, like an encoder processing input text

GPT generates one token at a time sequentially - each token depends on all previous ones; parallelism is only at training, not inference

Autoregressive generation is O(T) sequential forward passes; this is the main bottleneck for LLM latency and the reason KV-cache is critical

What is the main advantage of RoPE over the absolute positional embeddings used in GPT-2?

Key ideas

**Decoder-only**: stack of transformer blocks with pre-norm, weight tying, autoregressive generation one token at a time
**Causal attention**: upper triangular mask blocks future tokens - necessary for parallel training on sequences
**KV-Cache**: storing K/V vectors of previous tokens - O(T) instead of O(T^2) during generation
**RoPE**: rotating Q/K through sin/cos - relative positioning and extrapolation beyond training length

Вопросы для размышления

Why does weight tying (shared weights for token embeddings and LM head) improve generalization rather than cause underfitting?
How does Grouped Query Attention (GQA) reduce KV-cache memory consumption at inference without significant quality loss?
How does RoPE via frequency interpolation/extrapolation (YaRN) allow the model to work with contexts longer than it was trained on?

Связанные уроки

gai-03 — Encoder-decoder architecture and attention basics - the foundation GPT builds on
gai-05 — Fine-tuning and RLHF operate on top of the pretrained GPT architecture
aie-03-llm-fundamentals — Understanding decoder-only architecture is directly needed for LLM API work and prompt engineering
dl-03 — GPT extends Transformer blocks from dl-03: same Q/K/V attention and MLP blocks, but with causal mask
nlp-04 — Token embeddings in GPT are analogous to Word2Vec - dense vectors, but contextual rather than static
dl-01

Generative AI

GPT Architecture

Предварительные знания

Self-attention and multi-head attention mechanics
How tokens become embedding vectors

**GitHub Copilot**: GPT architecture for code completion - 1 million paid subscribers in the first year, $100M ARR
**Cursor IDE**: GPT-4 based code assistant - uses KV-cache for fast streaming, context window management for codebases
**Claude (Anthropic)**: decoder-only with RoPE positioning and extended context windows up to 200K tokens

Historical context

Decoder-only: autoregressive architecture

Why does GPT use a decoder-only architecture instead of a full encoder-decoder (like T5 or BART)?

Causal Self-Attention: masking the future

Why is a causal mask necessary during training but not conceptually during inference?

KV-Cache: accelerating autoregressive generation

Why is KV-cache unnecessary during training but critical at inference?

RoPE: Rotary Position Embedding

GPT generates the entire response in parallel, like an encoder processing input text

GPT generates one token at a time sequentially - each token depends on all previous ones; parallelism is only at training, not inference

Autoregressive generation is O(T) sequential forward passes; this is the main bottleneck for LLM latency and the reason KV-cache is critical

What is the main advantage of RoPE over the absolute positional embeddings used in GPT-2?

Key ideas

**Decoder-only**: stack of transformer blocks with pre-norm, weight tying, autoregressive generation one token at a time
**Causal attention**: upper triangular mask blocks future tokens - necessary for parallel training on sequences
**KV-Cache**: storing K/V vectors of previous tokens - O(T) instead of O(T^2) during generation
**RoPE**: rotating Q/K through sin/cos - relative positioning and extrapolation beyond training length

Вопросы для размышления

Why does weight tying (shared weights for token embeddings and LM head) improve generalization rather than cause underfitting?
How does Grouped Query Attention (GQA) reduce KV-cache memory consumption at inference without significant quality loss?
How does RoPE via frequency interpolation/extrapolation (YaRN) allow the model to work with contexts longer than it was trained on?

Связанные уроки

gai-03 — Encoder-decoder architecture and attention basics - the foundation GPT builds on
gai-05 — Fine-tuning and RLHF operate on top of the pretrained GPT architecture
aie-03-llm-fundamentals — Understanding decoder-only architecture is directly needed for LLM API work and prompt engineering
dl-03 — GPT extends Transformer blocks from dl-03: same Q/K/V attention and MLP blocks, but with causal mask
nlp-04 — Token embeddings in GPT are analogous to Word2Vec - dense vectors, but contextual rather than static
dl-01

GPT Architecture

Предварительные знания

Historical context

Decoder-only: autoregressive architecture

Causal Self-Attention: masking the future

KV-Cache: accelerating autoregressive generation

RoPE: Rotary Position Embedding

Related Topics

Key ideas

Вопросы для размышления

Связанные уроки

GPT Architecture

Предварительные знания

Historical context

Decoder-only: autoregressive architecture

Causal Self-Attention: masking the future

KV-Cache: accelerating autoregressive generation

RoPE: Rotary Position Embedding

Related Topics

Key ideas

Вопросы для размышления

Связанные уроки