Generative AI

GPT Architecture

Предварительные знания

  • Self-attention and multi-head attention mechanics
  • How tokens become embedding vectors
  • Tokenization: BPE, SentencePiece
  • Language Models: from n-gram to GPT

February 2023: ChatGPT reaches 100 million users in 2 months - faster than any product in history. Behind it is GPT-3.5 - a decoder-only transformer with 175 billion parameters. Architecture: 96 layers, 96 attention heads, dimension 12288. Trained on 570GB of text over several months on thousands of A100 GPUs. But the key insight is not scale. It is the right architecture: causal attention lets the model learn to predict the next token, which leads to emergent abilities - reasoning, code generation, instruction following.

  • **GitHub Copilot**: GPT architecture for code completion - 1 million paid subscribers in the first year, $100M ARR
  • **Cursor IDE**: GPT-4 based code assistant - uses KV-cache for fast streaming, context window management for codebases
  • **Claude (Anthropic)**: decoder-only with RoPE positioning and extended context windows up to 200K tokens

Historical context

In June 2018 OpenAI published 'Improving Language Understanding by Generative Pre-Training' - the first GPT. Months later Google countered with BERT (encoder-only, masked language modeling); OpenAI stayed with the decoder-only, causal-attention, generative pre-training recipe. GPT-1: 117M parameters, 12 layers. A year later - GPT-2 (1.5B), so 'dangerous' that OpenAI initially withheld the weights. In 2020 - GPT-3 (175B): few-shot learning without fine-tuning. In 2022 InstructGPT added RLHF. In 2023 ChatGPT changed the world. In 5 years one architectural choice (decoder-only + scale) became an industrial revolution.

Decoder-only: autoregressive architecture

GPT uses a **decoder-only** Transformer unlike the original encoder-decoder architecture (Vaswani 2017). No encoder - just a stack of decoder blocks. The task: autoregressive language modeling - predict the next token from the previous ones. Pre-training: minimize cross-entropy loss over next-token prediction on trillions of tokens.

**Pre-norm vs Post-norm**: the original Transformer (2017) used Post-norm (LayerNorm after residual). GPT uses **Pre-norm** (LayerNorm before attention/MLP). Pre-norm is more stable for training deep models - gradients do not explode. **Weight tying**: the token embedding and LM head share the weight matrix - memory savings and better generalization.

Why does GPT use a decoder-only architecture instead of a full encoder-decoder (like T5 or BART)?

Causal Self-Attention: masking the future

Causal (masked) self-attention is the key difference between decoder and encoder. Each token can only 'see' previous tokens (and itself), not future ones. This is implemented via a **causal mask**: the upper triangle of the attention weight matrix is filled with -inf before softmax. After softmax, the corresponding weights become 0.

**Flash Attention** (Dao et al., 2022) - memory-efficient causal attention: instead of materializing the full [T, T] matrix in HBM, it computes attention in tiles in SRAM. Memory: O(T) instead of O(T^2). Speed: 2-4x faster than the standard implementation. This made 100K+ token contexts feasible.

Why is a causal mask necessary during training but not conceptually during inference?

KV-Cache: accelerating autoregressive generation

During autoregressive generation GPT produces one token at a time. Without optimization: each new token requires recomputing K and V for all previous tokens - O(T^2) time complexity to generate T tokens. **KV-Cache**: store the K and V vectors of previous tokens; when a new token is added, only compute its K and V and concatenate with the cache.

KV-cache consumes O(T * n_layers * n_heads * head_dim) memory. For LLaMA-2-70B with seq_len=4096 and batch_size=32 that is ~350GB - larger than the model weights. Optimizations: **Grouped Query Attention** (GQA) groups Q heads sharing KV pairs; **Multi-Query Attention** (MQA) uses a single KV for all Q heads.

Why is KV-cache unnecessary during training but critical at inference?

RoPE: Rotary Position Embedding

GPT-2 uses absolute learned positional embeddings, capped at the maximum training length. **RoPE** (Rotary Position Embedding, Su et al., 2021) encodes position by rotating Q and K vectors: the dot product Q_i dot K_j automatically depends only on the position difference (i - j). This enables extrapolation beyond the training length.

RoPE is used in LLaMA, Mistral, Qwen, Gemma, and most modern open-source LLMs. **YaRN** (Yet another RoPE extensioN) and **LongRoPE** scale RoPE to 128K+ tokens via frequency interpolation. This lets LLaMA-3.1 operate with 128K tokens having been trained on 8K.

GPT generates the entire response in parallel, like an encoder processing input text

GPT generates one token at a time sequentially - each token depends on all previous ones; parallelism is only at training, not inference

Autoregressive generation is O(T) sequential forward passes; this is the main bottleneck for LLM latency and the reason KV-cache is critical

What is the main advantage of RoPE over the absolute positional embeddings used in GPT-2?

Related Topics

The GPT architecture sits in the middle of the generative AI pipeline:

  • Tokenization: BPE, SentencePiece — The embedding layer maps token IDs to vectors; vocabulary size sets the embedding matrix dimensions
  • LLM Training: Pre-training — This decoder-only architecture is what gets trained on trillions of tokens via next-token prediction
  • Language Models: from n-gram to GPT — Causal attention generalizes the n-gram idea of conditioning on prior context, with unlimited range

Key ideas

  • **Decoder-only**: stack of transformer blocks with pre-norm, weight tying, autoregressive generation one token at a time
  • **Causal attention**: upper triangular mask blocks future tokens - necessary for parallel training on sequences
  • **KV-Cache**: storing K/V vectors of previous tokens - O(T) instead of O(T^2) during generation
  • **RoPE**: rotating Q/K through sin/cos - relative positioning and extrapolation beyond training length

Вопросы для размышления

  • Why does weight tying (shared weights for token embeddings and LM head) improve generalization rather than cause underfitting?
  • How does Grouped Query Attention (GQA) reduce KV-cache memory consumption at inference without significant quality loss?
  • How does RoPE via frequency interpolation/extrapolation (YaRN) allow the model to work with contexts longer than it was trained on?

Связанные уроки

  • gai-03 — Encoder-decoder architecture and attention basics - the foundation GPT builds on
  • gai-05 — Fine-tuning and RLHF operate on top of the pretrained GPT architecture
  • aie-03-llm-fundamentals — Understanding decoder-only architecture is directly needed for LLM API work and prompt engineering
  • dl-03 — GPT extends Transformer blocks from dl-03: same Q/K/V attention and MLP blocks, but with causal mask
  • nlp-04 — Token embeddings in GPT are analogous to Word2Vec - dense vectors, but contextual rather than static
  • dl-01
GPT Architecture

0

1

Sign In