Computer Architecture

Memory Hierarchy: The Speed Pyramid

Цели урока

Understand the Memory Wall problem
Know the levels of the memory hierarchy and their characteristics
Understand temporal and spatial locality
Know the basics of how cache works

Предварительные знания

CPU structure
Concept of memory

CPU structure

A 4 GHz processor sits idle for 100+ cycles waiting for data from RAM. The memory hierarchy is the strategy to hide that latency.

Optimizing data structures for cache
Cache-oblivious algorithms
Profiling cache misses
Game development optimization

The Problem: Memory Wall

**Memory Wall:** CPU speed grows faster than memory speed. The gap widens every year.

Component	Latency	Relative
Register	~0.3 ns	1×
L1 cache	~1 ns	3×
L2 cache	~4 ns	12×
L3 cache	~12 ns	40×
RAM (DDR5)	~50-100 ns	150-300×
SSD NVMe	~20,000 ns	60,000×
HDD	~5,000,000 ns	15,000,000×

**Cost of stalling:** If data is not in cache, the CPU stalls for 100+ cycles. At 4 GHz that's ~25 ns of wasted time!

The Memory Wall is:

Hierarchy Levels

**Solution:** A hierarchy - small fast memory + large slow memory. Use both wisely!

Level	Size	Latency	Cost/GB
SRAM (cache)	MB	1-10 ns	~$50
DRAM (RAM)	GB	50-100 ns	~$5
Flash (SSD)	TB	10-100 µs	~$0.10
HDD	TB	5-10 ms	~$0.02

**Why not all SRAM?** 16 GB of SRAM would cost ~$800,000 (vs $50 for DRAM). Economics dictates the hierarchy!

Why is L1 cache smaller than L3?

Principles of Locality

The hierarchy works thanks to **locality** - patterns in how programs access data.

**Temporal Locality:** If data was accessed recently, it will likely be accessed again soon.

**Spatial Locality:** If data was accessed, nearby data will likely be accessed soon.

Pattern	Temporal	Spatial	Cache-friendly?
Sequential traversal	Low	High	Yes!
Loop on one variable	High	Low	Yes!
Random access	Low	Low	No!
Matrix traversal by rows	Low	High	Yes!
Matrix traversal by columns	Low	Low	No!

**Cache thrashing:** Random access to a large array kills performance. Every access is a cache miss!

An example of spatial locality:

How Cache Works

**Cache** - a fast copy of frequently-used data from RAM.

**Cache Line:** The cache loads not a single byte but a block (typically 64 bytes). This exploits spatial locality!

**Hit Rate:** A typical L1 hit rate is 95%+. This means only 5% of accesses go to L2.

Cache simply speeds up all memory accesses

Cache only speeds up repeated accesses (temporal) and neighboring data (spatial).

With random access, every access is a cache miss. The cache doesn't help - it only adds the overhead of checking.

If cache line = 64 bytes and int = 4 bytes, how many ints are loaded on a single cache miss?

Key Ideas

Memory Wall: CPU is 100-300× faster than RAM
Hierarchy: registers → L1 → L2 → L3 → RAM → SSD
Smaller = faster, but more expensive
Temporal locality: reuse of the same data
Spatial locality: access to neighboring data
Cache line = 64 bytes, loaded as a whole

Вопросы для размышления

How does data locality influence the choice of data structures in high-performance systems?
Why can a single cache miss cost more than hundreds of arithmetic operations?
What programming techniques maximize cache hits when traversing two-dimensional arrays?

Связанные уроки

os-01-intro

Computer Architecture

Memory Hierarchy: The Speed Pyramid

Цели урока

Understand the Memory Wall problem
Know the levels of the memory hierarchy and their characteristics
Understand temporal and spatial locality
Know the basics of how cache works

Предварительные знания

CPU structure
Concept of memory

CPU structure

A 4 GHz processor sits idle for 100+ cycles waiting for data from RAM. The memory hierarchy is the strategy to hide that latency.

Optimizing data structures for cache
Cache-oblivious algorithms
Profiling cache misses
Game development optimization

The Problem: Memory Wall

**Memory Wall:** CPU speed grows faster than memory speed. The gap widens every year.

Component	Latency	Relative
Register	~0.3 ns	1×
L1 cache	~1 ns	3×
L2 cache	~4 ns	12×
L3 cache	~12 ns	40×
RAM (DDR5)	~50-100 ns	150-300×
SSD NVMe	~20,000 ns	60,000×
HDD	~5,000,000 ns	15,000,000×

**Cost of stalling:** If data is not in cache, the CPU stalls for 100+ cycles. At 4 GHz that's ~25 ns of wasted time!

The Memory Wall is:

Hierarchy Levels

**Solution:** A hierarchy - small fast memory + large slow memory. Use both wisely!

Level	Size	Latency	Cost/GB
SRAM (cache)	MB	1-10 ns	~$50
DRAM (RAM)	GB	50-100 ns	~$5
Flash (SSD)	TB	10-100 µs	~$0.10
HDD	TB	5-10 ms	~$0.02

**Why not all SRAM?** 16 GB of SRAM would cost ~$800,000 (vs $50 for DRAM). Economics dictates the hierarchy!

Why is L1 cache smaller than L3?

Principles of Locality

The hierarchy works thanks to **locality** - patterns in how programs access data.

**Temporal Locality:** If data was accessed recently, it will likely be accessed again soon.

**Spatial Locality:** If data was accessed, nearby data will likely be accessed soon.

Pattern	Temporal	Spatial	Cache-friendly?
Sequential traversal	Low	High	Yes!
Loop on one variable	High	Low	Yes!
Random access	Low	Low	No!
Matrix traversal by rows	Low	High	Yes!
Matrix traversal by columns	Low	Low	No!

**Cache thrashing:** Random access to a large array kills performance. Every access is a cache miss!

An example of spatial locality:

How Cache Works

**Cache** - a fast copy of frequently-used data from RAM.

**Cache Line:** The cache loads not a single byte but a block (typically 64 bytes). This exploits spatial locality!

**Hit Rate:** A typical L1 hit rate is 95%+. This means only 5% of accesses go to L2.

Cache simply speeds up all memory accesses

Cache only speeds up repeated accesses (temporal) and neighboring data (spatial).

With random access, every access is a cache miss. The cache doesn't help - it only adds the overhead of checking.

If cache line = 64 bytes and int = 4 bytes, how many ints are loaded on a single cache miss?

Key Ideas

Memory Wall: CPU is 100-300× faster than RAM
Hierarchy: registers → L1 → L2 → L3 → RAM → SSD
Smaller = faster, but more expensive
Temporal locality: reuse of the same data
Spatial locality: access to neighboring data
Cache line = 64 bytes, loaded as a whole

Вопросы для размышления

How does data locality influence the choice of data structures in high-performance systems?
Why can a single cache miss cost more than hundreds of arithmetic operations?
What programming techniques maximize cache hits when traversing two-dimensional arrays?

Связанные уроки

os-01-intro

Memory Hierarchy: The Speed Pyramid

Цели урока

Предварительные знания

The Problem: Memory Wall

Hierarchy Levels

Principles of Locality

How Cache Works

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Memory Hierarchy: The Speed Pyramid

Цели урока

Предварительные знания

The Problem: Memory Wall

Hierarchy Levels

Principles of Locality

How Cache Works

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки