Computer Architecture
Memory Hierarchy: NUMA, HBM, and Non-Volatile Memory
The AMD MI300X costs USD 15 000. Its main advantage over the NVIDIA H100 is not compute. It is 192 GB of unified HBM3 memory against 80 GB. A 70-billion-parameter LLM fits entirely - no quantization, no offloading. Memory has become the currency of the AI era.
- NVIDIA H100: 3.35 TB/s HBM3 bandwidth - this is what makes FlashAttention 4x faster than standard attention, not extra FLOPS
- AMD EPYC 9654 dual-socket: 1 TB RAM, cross-NUMA penalty 3.2x - NUMA-aware huge pages cut PostgreSQL latency by 30%
- Intel Optane PMem: SAP HANA in-memory database at 6 TB instead of 1.5 TB DDR - restart without loading from disk
- Meta CXL memory: 1 TB DDR plus 4 TB CXL-DRAM per host - memory pooling across servers
NUMA: When the Distance to Memory Matters
A dual-socket AMD EPYC 9654 server (96 cores) sits in every major datacenter. Half the memory is physically closer to the first processor, half to the second. A thread on CPU0 core accessing CPU1 memory pays 120 ns instead of 80 ns. That is a 50% overhead - and this happens constantly when applications are NUMA-unaware.
**NUMA (Non-Uniform Memory Access)** is an architecture where multiple processors have their own local memory and connect through an interconnect. AMD EPYC connects nodes through Infinity Fabric. Intel Xeon through UPI (Ultra Path Interconnect). Local memory bandwidth: 460 GB/s. Through the interconnect: 120-200 GB/s. A 2-4x gap.
**NUMA in production.** Redis with --numa-node-bitmask runs 1.3-1.5x faster on NUMA systems. PostgreSQL with huge pages bound to a NUMA node cuts query latency by 30%. The Linux AutoNUMA daemon migrates memory pages toward the accessing CPU, but with multi-second lag - explicit binding through numactl is always more reliable.