Computer Architecture
GPU Architecture: SIMD and Massive Parallelism
ChatGPT processes your query in 1 second. Without a GPU it would take 10 minutes. Neural networks are enormous matrices. A GPU is a matrix multiplication machine with thousands of cores. GPU architecture is the foundation of the AI revolution.
- GPT-4 was trained on thousands of NVIDIA A100 GPUs
- Stable Diffusion: image generation in ~1 sec on GPU (vs 5 min on CPU)
- 3D rendering: real-time raytracing on GPU
- Cryptocurrency mining - massively parallel hashing
CPU vs GPU: Different Philosophies
The **CPU** is optimized for sequential, low-latency tasks: 8-64 complex cores, deep out-of-order execution, large caches. The **GPU** is optimized for throughput: thousands of simple cores, simple pipeline, high memory bandwidth.
**Amdahl's Law:** If 5% of a program is strictly sequential, the maximum speedup is 20×, regardless of how many GPU cores you add. The bottleneck is sequential CPU code.
Why does a GPU have far greater memory bandwidth than a CPU?
SIMT and Warps
**SIMT (Single Instruction, Multiple Threads)** - all threads in a group (warp) execute the same instruction but operate on different data. This is the core idea behind GPU programming.
**Warp Divergence:** If some threads in a warp take the `if` branch and others take `else`, both branches execute sequentially with inactive threads masked. Where possible, push branching up to the block level.
A warp of 32 threads splits 50/50 across an if/else. How long does this take relative to a warp without branching?
GPU Memory Hierarchy
**GPUs have their own memory hierarchy**, optimized for high throughput. Using memory correctly is the key to GPU performance.
**GPU for ML:** Transformers (GPT, BERT) are essentially matrix multiplications. Tensor Cores in NVIDIA GPUs perform a 4×4 matrix multiply in one cycle. H100: 3958 TFLOPS for FP8 - which is exactly why AI demands GPUs.
A GPU is faster than a CPU because it has more cores. The more FP32 units, the faster any program runs.
GPU performance is bottlenecked by the memory hierarchy, not core count. HBM bandwidth, shared memory, coalesced access, register pressure define real FLOPS more than peak compute. Tensor Cores accelerate one specific matmul pattern - outside of it the GPU stalls.
The "more parallelism equals more speed" intuition ignores the memory wall. A naive CUDA kernel often achieves 5-10% of advertised TFLOPS because warps stall on global-memory loads (400-800 cycles latency). Roofline analysis shows that almost all ML workloads are memory-bound, not compute-bound. That is why FlashAttention rewrites attention to minimize HBM traffic and runs 2-4x faster with no new FLOPS.
Why use shared memory in CUDA kernels instead of accessing global memory directly?
Key Takeaways
- GPU: thousands of simple cores vs dozens of complex ones in a CPU
- SIMT: 32 threads (warp) execute one instruction with different data
- Warp Divergence: branching within a warp is the main performance enemy
- Shared memory: fast on-chip memory for block data (explicitly managed)
- Coalesced access: neighboring threads must access neighboring addresses
Related Topics
GPU architecture complements the CPU - heterogeneous computing.
- I/O and DMA — GPUs use PCIe DMA to transfer data between CPU and GPU
- Memory Hierarchy — GPUs have their own multi-level memory hierarchy
Вопросы для размышления
- Why is a GPU poorly suited for tasks with irregular memory access (e.g., graph traversal)?
- How do Tensor Cores differ from regular CUDA cores? Why are they important for AI?
- What is occupancy in CUDA, and why doesn't higher occupancy always mean better performance?
Связанные уроки
- ca-14 — Multicore CPU fundamentals before SIMD at GPU scale
- arch-14-multicore — GPU pushes parallelism to thousands of lightweight threads
- arch-16-multicore-programming — CUDA/OpenCL is the application layer on top of SIMT
- ml-01-intro — All deep learning depends on GPU matrix computation
- par-01