Computer Architecture

GPU Architecture: SIMD and Massive Parallelism

ChatGPT processes your query in 1 second. Without a GPU it would take 10 minutes. Neural networks are enormous matrices. A GPU is a matrix multiplication machine with thousands of cores. GPU architecture is the foundation of the AI revolution.

GPT-4 was trained on thousands of NVIDIA A100 GPUs
Stable Diffusion: image generation in ~1 sec on GPU (vs 5 min on CPU)
3D rendering: real-time raytracing on GPU
Cryptocurrency mining - massively parallel hashing

CPU vs GPU: Different Philosophies

The **CPU** is optimized for sequential, low-latency tasks: 8-64 complex cores, deep out-of-order execution, large caches. The **GPU** is optimized for throughput: thousands of simple cores, simple pipeline, high memory bandwidth.

**Amdahl's Law:** If 5% of a program is strictly sequential, the maximum speedup is 20×, regardless of how many GPU cores you add. The bottleneck is sequential CPU code.

Why does a GPU have far greater memory bandwidth than a CPU?

SIMT and Warps

**SIMT (Single Instruction, Multiple Threads)** - all threads in a group (warp) execute the same instruction but operate on different data. This is the core idea behind GPU programming.

**Warp Divergence:** If some threads in a warp take the `if` branch and others take `else`, both branches execute sequentially with inactive threads masked. Where possible, push branching up to the block level.

A warp of 32 threads splits 50/50 across an if/else. How long does this take relative to a warp without branching?

GPU Memory Hierarchy

**GPUs have their own memory hierarchy**, optimized for high throughput. Using memory correctly is the key to GPU performance.

**GPU for ML:** Transformers (GPT, BERT) are essentially matrix multiplications. Tensor Cores in NVIDIA GPUs perform a 4×4 matrix multiply in one cycle. H100: 3958 TFLOPS for FP8 - which is exactly why AI demands GPUs.

A GPU is faster than a CPU because it has more cores. The more FP32 units, the faster any program runs.

GPU performance is bottlenecked by the memory hierarchy, not core count. HBM bandwidth, shared memory, coalesced access, register pressure define real FLOPS more than peak compute. Tensor Cores accelerate one specific matmul pattern - outside of it the GPU stalls.

The "more parallelism equals more speed" intuition ignores the memory wall. A naive CUDA kernel often achieves 5-10% of advertised TFLOPS because warps stall on global-memory loads (400-800 cycles latency). Roofline analysis shows that almost all ML workloads are memory-bound, not compute-bound. That is why FlashAttention rewrites attention to minimize HBM traffic and runs 2-4x faster with no new FLOPS.

Why use shared memory in CUDA kernels instead of accessing global memory directly?

Key Takeaways

GPU: thousands of simple cores vs dozens of complex ones in a CPU
SIMT: 32 threads (warp) execute one instruction with different data
Warp Divergence: branching within a warp is the main performance enemy
Shared memory: fast on-chip memory for block data (explicitly managed)
Coalesced access: neighboring threads must access neighboring addresses

Вопросы для размышления

Why is a GPU poorly suited for tasks with irregular memory access (e.g., graph traversal)?
How do Tensor Cores differ from regular CUDA cores? Why are they important for AI?
What is occupancy in CUDA, and why doesn't higher occupancy always mean better performance?

Связанные уроки

ca-14 — Multicore CPU fundamentals before SIMD at GPU scale
arch-14-multicore — GPU pushes parallelism to thousands of lightweight threads
arch-16-multicore-programming — CUDA/OpenCL is the application layer on top of SIMT
ml-01-intro — All deep learning depends on GPU matrix computation
par-01