Computer Architecture
Memory Hierarchy: The Speed Pyramid
Цели урока
- Understand the Memory Wall problem
- Know the levels of the memory hierarchy and their characteristics
- Understand temporal and spatial locality
- Know the basics of how cache works
Предварительные знания
- CPU structure
- Concept of memory
A 4 GHz processor sits idle for 100+ cycles waiting for data from RAM. The memory hierarchy is the strategy to hide that latency.
- Optimizing data structures for cache
- Cache-oblivious algorithms
- Profiling cache misses
- Game development optimization
The Problem: Memory Wall
**Memory Wall:** CPU speed grows faster than memory speed. The gap widens every year.
| Component | Latency | Relative |
|---|---|---|
| Register | ~0.3 ns | 1× |
| L1 cache | ~1 ns | 3× |
| L2 cache | ~4 ns | 12× |
| L3 cache | ~12 ns | 40× |
| RAM (DDR5) | ~50-100 ns | 150-300× |
| SSD NVMe | ~20,000 ns | 60,000× |
| HDD | ~5,000,000 ns | 15,000,000× |
**Cost of stalling:** If data is not in cache, the CPU stalls for 100+ cycles. At 4 GHz that's ~25 ns of wasted time!
The Memory Wall is:
Hierarchy Levels
**Solution:** A hierarchy - small fast memory + large slow memory. Use both wisely!
| Level | Size | Latency | Cost/GB |
|---|---|---|---|
| SRAM (cache) | MB | 1-10 ns | ~$50 |
| DRAM (RAM) | GB | 50-100 ns | ~$5 |
| Flash (SSD) | TB | 10-100 µs | ~$0.10 |
| HDD | TB | 5-10 ms | ~$0.02 |
**Why not all SRAM?** 16 GB of SRAM would cost ~$800,000 (vs $50 for DRAM). Economics dictates the hierarchy!
Why is L1 cache smaller than L3?
Principles of Locality
The hierarchy works thanks to **locality** - patterns in how programs access data.
**Temporal Locality:** If data was accessed recently, it will likely be accessed again soon.
**Spatial Locality:** If data was accessed, nearby data will likely be accessed soon.
| Pattern | Temporal | Spatial | Cache-friendly? |
|---|---|---|---|
| Sequential traversal | Low | High | Yes! |
| Loop on one variable | High | Low | Yes! |
| Random access | Low | Low | No! |
| Matrix traversal by rows | Low | High | Yes! |
| Matrix traversal by columns | Low | Low | No! |
**Cache thrashing:** Random access to a large array kills performance. Every access is a cache miss!
An example of spatial locality:
How Cache Works
**Cache** - a fast copy of frequently-used data from RAM.
**Cache Line:** The cache loads not a single byte but a block (typically 64 bytes). This exploits spatial locality!
**Hit Rate:** A typical L1 hit rate is 95%+. This means only 5% of accesses go to L2.
Cache simply speeds up all memory accesses
Cache only speeds up repeated accesses (temporal) and neighboring data (spatial).
With random access, every access is a cache miss. The cache doesn't help - it only adds the overhead of checking.
If cache line = 64 bytes and int = 4 bytes, how many ints are loaded on a single cache miss?
Key Ideas
- Memory Wall: CPU is 100-300× faster than RAM
- Hierarchy: registers → L1 → L2 → L3 → RAM → SSD
- Smaller = faster, but more expensive
- Temporal locality: reuse of the same data
- Spatial locality: access to neighboring data
- Cache line = 64 bytes, loaded as a whole
Related Topics
Memory hierarchy is the foundation for understanding performance.
- Cache memory — Detailed internals of cache
- Virtual memory — Abstraction over physical memory
Вопросы для размышления
- How does data locality influence the choice of data structures in high-performance systems?
- Why can a single cache miss cost more than hundreds of arithmetic operations?
- What programming techniques maximize cache hits when traversing two-dimensional arrays?