Operating Systems

Caching in OS

Every second, a running computer makes billions of memory accesses. Without caches, the processor would idle 99% of the time, waiting for data from RAM. Without page cache, each file read would take milliseconds instead of microseconds. **Caching is the reason modern systems work.** RAM is 100,000 times faster than disk, but L1 cache is 100 times faster than RAM. These layers are invisible but absolutely critical for performance.

**Why does PostgreSQL require huge RAM?** A 100 GB database with 16 GB RAM runs 10x faster than with 4 GB. Page cache keeps hot data in memory - repeated queries are instant. A server with 512 GB RAM can cache the entire DB.
**How does Redis achieve a million operations/sec?** The entire DB is in RAM, zero disk accesses. Plus CPU cache locality: keys in the hash table are compactly located. L1/L2 cache hit rate > 95% → latency < 1 microsecond.
**Why are huge pages critical for virtualization?** A VM with 64 GB of memory generates millions of TLB misses. Huge pages (2 MB) reduce TLB entries by 512 times → TLB miss rate drops from 20% to 0.1%. VM performance: +30-50%.

Цели урока

Know the latency hierarchy: L1 ~1ns, L2 ~3ns, L3 ~12ns, RAM ~100ns
TLB: hit rate >99%, miss costs tens of ns for a page table walk
Page cache: Linux caches all reads/writes through RAM by default
Buffer cache: filesystem metadata (inodes, directory entries)
Apply fadvise/madvise to hint access patterns to the kernel

CPU Cache: Memory Hierarchy

**CPU cache** is ultra-fast memory inside the processor that hides the latency of accessing RAM. Without cache, each memory access would take ~100 nanoseconds (hundreds of CPU cycles). With cache - 1-4 nanoseconds. The difference is **100 times**.

The processor operates at a frequency of ~4 GHz (one cycle = 0.25 nanoseconds). Accessing RAM takes ~300 cycles. If each instruction waited for data from memory, the CPU would idle 99% of the time!

**Cache line** - the minimum unit of caching. Usually 64 bytes. Reading one byte causes the processor to load an entire line (64 bytes) into the cache. This is spatial locality: neighboring data is often used together.

Cache miss penalty in real life

Run a benchmark: traverse a 1 GB array. **Sequential access:** 4 GB/s. **Random access:** 200 MB/s. The difference is **20 times** - this is the cost of cache misses. The processor idles, waiting for data from RAM.

**Cache associativity.** Where to place a memory block in the cache? - **Direct-mapped:** each address → strictly defined cache line. Fast, but conflicts. - **Fully associative:** can be in any line. No conflicts, but slow search. - **N-way set associative:** compromise. For example, 8-way: a block can be in any of 8 lines of a set.

False sharing: a hidden performance enemy

Two threads work with different variables, but they are in the same cache line (64 bytes). Each change invalidates the cache of the other core → constant cache misses. **Solution:** padding to place variables in different cache lines. This gives a 5-10x boost in multithreaded code.

A program traverses an int[1024][1024] array. Row-major order gives 100 MB/s, column-major - 10 MB/s. What's the reason?

TLB: Page Table Cache

**TLB (Translation Lookaside Buffer)** is a cache for address translation virtual address → physical address. Without TLB, each memory access would require 3-4 additional RAM reads (page table walk). TLB reduces this to zero.

When a program accesses address `0x7fff1234`, the processor must: 1. Take the higher bits (page number): `0x7fff` 2. Find the physical page address in the page table 3. Add the offset (`0x1234`) The page table is a multi-level structure in RAM. Walking it takes 3-4 memory reads (~400 nanoseconds)!

**TLB is a cache for Page Table Entries (PTE).** Size: 64-512 entries. Each entry: virtual page → physical page. **TLB hit**: translation in 1 cycle. **TLB miss:** processor performs page table walk in RAM.

**TLB miss penalty** is huge: if the TLB miss rate increases from 1% to 5%, the program can slow down by **50%**. This happens when working with large memory volumes (databases, scientific computations).

TLB thrashing in the real world

A database scans a table of size 10 GB. Memory page = 4 KB. 2.5 million TLB entries are needed, but the processor has only 512. **Result:** TLB miss on each access → performance drops 3-5 times. **Solution:** huge pages (2 MB instead of 4 KB) - 512 times fewer TLB entries needed!

**TLB shootdown** - an expensive operation in multi-core systems. When the OS changes the page table (e.g., unmaps a page), it must invalidate the TLB on **all CPU cores**. This requires an inter-processor interrupt (IPI).

Optimization: PCID avoids TLB flush

Previously, a context switch (process switch) flushed the entire TLB - all 512 entries were discarded. Modern CPUs (Intel since 2010) support **PCID (Process Context ID)**: each TLB entry is tagged with a process ID. Context switching no longer requires TLB flush - entries of different processes coexist in the TLB. Context switch acceleration: 20-30%.

A database works with 20 GB of memory, TLB holds 512 entries, page size is 4 KB. TLB miss rate = 98%. What will improve performance?

Page Cache: Disk Cache in RAM

**Page cache** (or disk cache) is a cache of file contents in RAM. Reading from disk takes ~10 milliseconds (HDD) or ~100 microseconds (SSD). Reading from RAM - 100 nanoseconds. The difference is **100,000 times** (HDD) or **1000 times** (SSD).

Reading a file via `read()` doesn't cause the OS to go to disk directly. It first checks the page cache: maybe this data is already in memory? If yes - instant response. If not - reads from disk and caches for future use.

**Page cache uses all free RAM.** In Linux, the `free` command shows: ``` total used free buff/cache available Mem: 16GB 2GB 1GB 13GB 14GB ``` 13 GB in **buff/cache** - this is the page cache. It is not "occupied" irreversibly: if an application needs memory, the OS will instantly free the cache.

Page cache for databases

PostgreSQL reads a table of size 10 GB. First full scan: 100 seconds (reading from SSD). Second scan: 5 seconds (from page cache) - **20x faster**. That's why production databases always run on servers with huge RAM: page cache is free acceleration. A server with 256 GB RAM can cache the entire working dataset of a DB.

**Read-ahead** - the OS anticipates sequential access patterns. During sequential file reading, the kernel automatically loads the next pages into the cache before the application requests them. This hides disk latency.

**Write-back** - a `write()` call places data in the page cache, and the OS flushes it to disk **later** (asynchronously). This gives a huge speedup but is risky: if the system crashes before flush - data is lost.

Write optimization: batching

A program writes a log: 1000 lines/sec, each with `fsync()`. HDD does 100 IOPS → **10x overload**. Solution: batching - collect 10 lines, one `fsync()`. Throughput: 100 lines/sec → 1000 lines/sec. This is the basis of write-ahead log (WAL) in databases: group transactions, one fsync per batch.

An application writes 10,000 log lines to a file, each entry with fsync(). HDD supports 100 IOPS. How long will the write take?

Buffer Cache: File System Metadata

**Buffer cache** is a cache of file system metadata: inodes, directory entries, block bitmaps. Previously (before Linux 2.4), buffer cache was separate from page cache. Now they are unified, but conceptually differ in purpose.

**Page cache** stores file contents. **Buffer cache** stores the structures of the file system itself: where the file is located on disk, which blocks are free, access rights. Without buffer cache, each `ls`, `stat()`, `open()` would require reading metadata from disk.

**Dentry cache (dcache)** - a cache of file paths. Opening `/home/user/file.txt` causes the OS to remember: `"/home/user/file.txt" → inode 12345`. The next `open()` of the same file is instant - no need to traverse directories.

Buffer cache for git operations

Run `git status` in a repository with 10,000 files. First run: 2 seconds (reading all inodes/dentries from disk). Second run immediately after: 0.1 second - **20x faster**. Dentry cache remembered the directory structure. This is why IDEs and git work so fast with repeated operations.

**Block bitmaps and allocation metadata** - another type of buffered data. When the file system allocates a new block for a file, it reads the block bitmap (bitmap of free blocks). These structures are frequently modified, so it's critical to keep them in RAM.

**Journal/Log in file systems** (ext4, XFS) is also buffered. The journal is a circular log of metadata changes. Transactions are first written to the journal (sequentially, quickly), then asynchronously applied to the main structures. Journal in RAM → transactions are instant.

Optimization: tmpfs for temporary files

Compiling a large project creates 100,000 `.o` files in `/tmp`. On a regular FS: each file → inode allocation, block allocation, directory update. Huge load on buffer cache and disk. **Solution:** `tmpfs` - a file system in RAM. No disk operations, everything in memory. Compilation speeds up 2-3 times. After reboot, tmpfs is cleared - perfect for /tmp.

Page cache and buffer cache are the same thing, different names for one entity

Page cache and buffer cache are conceptually different: page cache stores file contents, buffer cache - file system metadata (inodes, dentries, bitmaps)

Although in modern Linux (since kernel 2.4) they are unified into one subsystem (unified page cache), semantically they solve different tasks. **Page cache:** during `read(fd, buf, 1024)` - data is cached here. This is the file content. **Buffer cache:** during `open("/path/file")` - the OS must traverse the path, read inodes, directory entries. These metadata are cached in dentry/inode cache (part of buffer cache). Without buffer cache, even a simple `ls` would require dozens of disk accesses to read directory structures. Understanding the separation of concepts is important for knowing which parts of the cache to drop (`echo 1` vs `echo 2` in `/proc/sys/vm/drop_caches`), and how to tune FS for different loads.

A program creates 10,000 small files (each 1 KB). Why is buffer cache critical for performance?

Key Ideas

**CPU cache hides RAM latency.** L1/L2/L3 caches provide data access in 1-12 ns instead of 100 ns (RAM). Cache miss penalty is huge: a 100x difference. Optimizing code for cache locality gives a 10-20x boost.
**TLB caches virtual → physical translation.** Without TLB, each memory access would require 3-4 additional RAM reads (page table walk). TLB hit rate 95-99% - critical for performance. Huge pages (2 MB) reduce TLB pressure by 512 times.
**Page cache turns disk into RAM.** Repeated file reads from page cache are 1000x faster (SSD) or 100,000x faster (HDD). OS uses all free RAM for page cache - it's not "wasted" memory, but an accelerator. Read-ahead and write-back hide disk latency.
**Buffer cache speeds up FS metadata.** Inodes, dentries, block bitmaps in RAM → file operations (open, stat, ls) are instant. Creating 10,000 files: 40,000 disk IO without buffer cache, ~10 IO with cache. Difference: 100 seconds vs 1 second.

Вопросы для размышления

If an application processes 1 TB of data, and RAM is only 16 GB, how can page cache still provide acceleration? Hint: hot data vs cold data.
TLB on the processor holds 512 entries. The program works with 10 GB of memory. How do huge pages (2 MB) help if pages are chosen randomly (random access)?
A database performs fsync() after each transaction for reliability, but this kills performance (10 ms on HDD). How to combine speed and reliability? Hint: group commit.

Связанные уроки

Operating Systems

Caching in OS

**Why does PostgreSQL require huge RAM?** A 100 GB database with 16 GB RAM runs 10x faster than with 4 GB. Page cache keeps hot data in memory - repeated queries are instant. A server with 512 GB RAM can cache the entire DB.
**How does Redis achieve a million operations/sec?** The entire DB is in RAM, zero disk accesses. Plus CPU cache locality: keys in the hash table are compactly located. L1/L2 cache hit rate > 95% → latency < 1 microsecond.
**Why are huge pages critical for virtualization?** A VM with 64 GB of memory generates millions of TLB misses. Huge pages (2 MB) reduce TLB entries by 512 times → TLB miss rate drops from 20% to 0.1%. VM performance: +30-50%.

Цели урока

Know the latency hierarchy: L1 ~1ns, L2 ~3ns, L3 ~12ns, RAM ~100ns
TLB: hit rate >99%, miss costs tens of ns for a page table walk
Page cache: Linux caches all reads/writes through RAM by default
Buffer cache: filesystem metadata (inodes, directory entries)
Apply fadvise/madvise to hint access patterns to the kernel

CPU Cache: Memory Hierarchy

The processor operates at a frequency of ~4 GHz (one cycle = 0.25 nanoseconds). Accessing RAM takes ~300 cycles. If each instruction waited for data from memory, the CPU would idle 99% of the time!

Cache miss penalty in real life

False sharing: a hidden performance enemy

A program traverses an int[1024][1024] array. Row-major order gives 100 MB/s, column-major - 10 MB/s. What's the reason?

TLB: Page Table Cache

TLB thrashing in the real world

Optimization: PCID avoids TLB flush

A database works with 20 GB of memory, TLB holds 512 entries, page size is 4 KB. TLB miss rate = 98%. What will improve performance?

Page Cache: Disk Cache in RAM

Page cache for databases

Write optimization: batching

An application writes 10,000 log lines to a file, each entry with fsync(). HDD supports 100 IOPS. How long will the write take?

Buffer Cache: File System Metadata

Buffer cache for git operations

Optimization: tmpfs for temporary files

Page cache and buffer cache are the same thing, different names for one entity

Page cache and buffer cache are conceptually different: page cache stores file contents, buffer cache - file system metadata (inodes, dentries, bitmaps)

A program creates 10,000 small files (each 1 KB). Why is buffer cache critical for performance?

Key Ideas

**CPU cache hides RAM latency.** L1/L2/L3 caches provide data access in 1-12 ns instead of 100 ns (RAM). Cache miss penalty is huge: a 100x difference. Optimizing code for cache locality gives a 10-20x boost.
**TLB caches virtual → physical translation.** Without TLB, each memory access would require 3-4 additional RAM reads (page table walk). TLB hit rate 95-99% - critical for performance. Huge pages (2 MB) reduce TLB pressure by 512 times.
**Page cache turns disk into RAM.** Repeated file reads from page cache are 1000x faster (SSD) or 100,000x faster (HDD). OS uses all free RAM for page cache - it's not "wasted" memory, but an accelerator. Read-ahead and write-back hide disk latency.
**Buffer cache speeds up FS metadata.** Inodes, dentries, block bitmaps in RAM → file operations (open, stat, ls) are instant. Creating 10,000 files: 40,000 disk IO without buffer cache, ~10 IO with cache. Difference: 100 seconds vs 1 second.

Вопросы для размышления

If an application processes 1 TB of data, and RAM is only 16 GB, how can page cache still provide acceleration? Hint: hot data vs cold data.
TLB on the processor holds 512 entries. The program works with 10 GB of memory. How do huge pages (2 MB) help if pages are chosen randomly (random access)?
A database performs fsync() after each transaction for reliability, but this kills performance (10 ms on HDD). How to combine speed and reliability? Hint: group commit.

Caching in OS

Цели урока

CPU Cache: Memory Hierarchy

Cache miss penalty in real life

False sharing: a hidden performance enemy

TLB: Page Table Cache

TLB thrashing in the real world

Optimization: PCID avoids TLB flush

Page Cache: Disk Cache in RAM

Page cache for databases

Write optimization: batching

Buffer Cache: File System Metadata

Buffer cache for git operations

Optimization: tmpfs for temporary files

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Caching in OS

Цели урока

CPU Cache: Memory Hierarchy

Cache miss penalty in real life

False sharing: a hidden performance enemy

TLB: Page Table Cache

TLB thrashing in the real world

Optimization: PCID avoids TLB flush

Page Cache: Disk Cache in RAM

Page cache for databases

Write optimization: batching

Buffer Cache: File System Metadata

Buffer cache for git operations

Optimization: tmpfs for temporary files

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки