Computer Architecture

Memory Hierarchy: NUMA, HBM, and Non-Volatile Memory

The AMD MI300X costs USD 15 000. Its main advantage over the NVIDIA H100 is not compute. It is 192 GB of unified HBM3 memory against 80 GB. A 70-billion-parameter LLM fits entirely - no quantization, no offloading. Memory has become the currency of the AI era.

  • NVIDIA H100: 3.35 TB/s HBM3 bandwidth - this is what makes FlashAttention 4x faster than standard attention, not extra FLOPS
  • AMD EPYC 9654 dual-socket: 1 TB RAM, cross-NUMA penalty 3.2x - NUMA-aware huge pages cut PostgreSQL latency by 30%
  • Intel Optane PMem: SAP HANA in-memory database at 6 TB instead of 1.5 TB DDR - restart without loading from disk
  • Meta CXL memory: 1 TB DDR plus 4 TB CXL-DRAM per host - memory pooling across servers

NUMA: When the Distance to Memory Matters

A dual-socket AMD EPYC 9654 server (96 cores) sits in every major datacenter. Half the memory is physically closer to the first processor, half to the second. A thread on CPU0 core accessing CPU1 memory pays 120 ns instead of 80 ns. That is a 50% overhead - and this happens constantly when applications are NUMA-unaware.

**NUMA (Non-Uniform Memory Access)** is an architecture where multiple processors have their own local memory and connect through an interconnect. AMD EPYC connects nodes through Infinity Fabric. Intel Xeon through UPI (Ultra Path Interconnect). Local memory bandwidth: 460 GB/s. Through the interconnect: 120-200 GB/s. A 2-4x gap.

**NUMA in production.** Redis with --numa-node-bitmask runs 1.3-1.5x faster on NUMA systems. PostgreSQL with huge pages bound to a NUMA node cuts query latency by 30%. The Linux AutoNUMA daemon migrates memory pages toward the accessing CPU, but with multi-second lag - explicit binding through numactl is always more reliable.

0

1

Sign In

On a dual-socket NUMA server, a thread on CPU0 accesses memory allocated on CPU1. What happens?

HBM: 3D Memory for Bandwidth-Hungry Compute

The NVIDIA H100 computes matrix products faster than memory can feed them. That is exactly why 80 GB HBM3 at 3.35 TB/s matters more than extra TFLOPS. Comparison: DDR5 on CPU peaks at 88 GB/s. HBM3 on H100: 3350 GB/s. A 38x gap. The entire premise of GPU-accelerated ML rests on this bandwidth.

Intel Sapphire Rapids Xeon offers HBM2e directly on the processor package: 64 GB at 1 TB/s. It can operate as an L4 cache or as a separate NUMA node. AMD CDNA3 (Instinct MI300X) goes further: CPU and GPU dies on one package, 192 GB HBM3 unified - no PCIe transfers between host and device. An entire 70B-parameter model fits without quantization.

**Roofline model.** When arithmetic intensity (FLOPS per byte) falls below the bandwidth-limited ridge, the workload is memory-bound. FlashAttention rewrites standard attention to minimize HBM round-trips: tiles stay in SRAM on the SM. Result: 2-4x speedup with zero new FLOPS - purely from reducing HBM traffic.

HBM is 40x faster than DDR5 in bandwidth. Why are ML workloads still memory-bound on H100?

Non-Volatile Memory: Memory That Does Not Forget

2019. Intel ships Optane Persistent Memory based on 3D XPoint: byte-addressable, survives power loss, latency near 300 ns. Not DRAM at 80 ns, but 1000x faster than NVMe SSD at 80 microseconds. A server gets 6 TB of persistent memory instead of 1.5 TB DDR. SAP HANA in-memory database fits entirely in PMem - cold start becomes instantaneous.

**CXL (Compute Express Link)** extends NUMA to the memory pooling era. CXL 3.0 enables sharing memory across multiple servers over PCIe 6.0 physical layer. A memory shelf in a rack becomes a shared pool for dozens of hosts. Meta tests CXL memory for its infrastructure: 1 TB DDR plus 4 TB CXL-DRAM per host. CXL latency: 300-500 ns - higher than local DRAM but incomparably better than NFS.

**NVM in ML pipelines.** PyTorch 2.0 supports memory-mapped tensors through mmap: an NVM volume mounted at /mnt/pmem holds a 1 TB ImageNet dataset. The loader accesses it directly via load instructions - no copy to DRAM, no warm-up phase. Cold start drops from 10 minutes to zero. NVIDIA DGX H100 includes 3.84 TB NVMe as a 'near-storage' tier in the memory hierarchy.

NVM/Optane is just a faster SSD, closer to storage than to memory.

NVM in App Direct mode appears as a NUMA memory node: directly addressed by CPU load/store instructions over the memory bus, not via PCIe/SATA. Latency is 300 ns vs 80 microseconds for NVMe SSD - a 250x difference.

The confusion comes from marketing around 'persistent storage'. NVM in App Direct mode is byte-addressable memory with durability, not a block device. This is why CLWB (cache line write back) instructions are needed for durability guarantees - the same reason one uses memory barriers with DRAM and NVDIMM.

How does DAX mode for NVM differ from a regular mmap file?

Related Topics

Memory hierarchy determines the performance ceiling of the entire stack - from a single core to a distributed cluster.

  • Cache Memory — L1-L3 caches are the upper levels of the same hierarchy that includes NUMA and HBM
  • Multicore Programming — NUMA-aware allocation and thread binding are the application layer built on NUMA architecture
  • AI Accelerators — TPU and NPU design is at the core constrained by memory bandwidth and capacity

Key Ideas

  • NUMA: local memory 80 ns, remote via interconnect 150-300 ns - numactl --membind is critical for production servers
  • HBM: 1024-bit bus, 3D stacking, 3350 GB/s - 40x more bandwidth than DDR5, 60x more expensive, justified for ML
  • Memory-bound: most ML workloads are limited by bandwidth, not FLOPS - FlashAttention solves this algorithmically
  • NVM App Direct: byte-addressable persistent memory over the memory bus, not a block device, 300 ns latency
  • CXL: memory pooling across servers over PCIe physical layer - the next step beyond NUMA

Вопросы для размышления

  • AMD MI300X unifies CPU and GPU dies with shared HBM3. What workloads benefit most, and what new bottlenecks emerge?
  • Linux AutoNUMA automatically migrates pages toward accessing CPUs. Why does this not fully solve the NUMA problem?
  • CXL memory has 300-500 ns latency versus 80 ns for local DRAM. For which workloads is this acceptable, and for which is it critical?

Связанные уроки

  • arch-08-memory-hierarchy — Cache hierarchy is the foundation for understanding NUMA and HBM
  • arch-09-cache — Cache coherence determines NUMA performance characteristics
  • arch-15-gpu-architecture — HBM is used in GPUs too - shared technology across architectures
  • arch-16-multicore-programming — NUMA-aware programming is critical for parallel code performance
  • arch-18-ai-accelerators — HBM and memory bandwidth are the key constraint of AI accelerators
Memory Hierarchy: NUMA, HBM, and Non-Volatile Memory