Operating Systems
Caching in OS
Every second, a running computer makes billions of memory accesses. Without caches, the processor would idle 99% of the time, waiting for data from RAM. Without page cache, each file read would take milliseconds instead of microseconds. **Caching is the reason modern systems work.** RAM is 100,000 times faster than disk, but L1 cache is 100 times faster than RAM. These layers are invisible but absolutely critical for performance.
- **Why does PostgreSQL require huge RAM?** A 100 GB database with 16 GB RAM runs 10x faster than with 4 GB. Page cache keeps hot data in memory - repeated queries are instant. A server with 512 GB RAM can cache the entire DB.
- **How does Redis achieve a million operations/sec?** The entire DB is in RAM, zero disk accesses. Plus CPU cache locality: keys in the hash table are compactly located. L1/L2 cache hit rate > 95% → latency < 1 microsecond.
- **Why are huge pages critical for virtualization?** A VM with 64 GB of memory generates millions of TLB misses. Huge pages (2 MB) reduce TLB entries by 512 times → TLB miss rate drops from 20% to 0.1%. VM performance: +30-50%.
Цели урока
- Know the latency hierarchy: L1 ~1ns, L2 ~3ns, L3 ~12ns, RAM ~100ns
- TLB: hit rate >99%, miss costs tens of ns for a page table walk
- Page cache: Linux caches all reads/writes through RAM by default
- Buffer cache: filesystem metadata (inodes, directory entries)
- Apply fadvise/madvise to hint access patterns to the kernel
CPU Cache: Memory Hierarchy
**CPU cache** is ultra-fast memory inside the processor that hides the latency of accessing RAM. Without cache, each memory access would take ~100 nanoseconds (hundreds of CPU cycles). With cache - 1-4 nanoseconds. The difference is **100 times**.
The processor operates at a frequency of ~4 GHz (one cycle = 0.25 nanoseconds). Accessing RAM takes ~300 cycles. If each instruction waited for data from memory, the CPU would idle 99% of the time!
**Cache line** - the minimum unit of caching. Usually 64 bytes. Reading one byte causes the processor to load an entire line (64 bytes) into the cache. This is spatial locality: neighboring data is often used together.
Cache miss penalty in real life
Run a benchmark: traverse a 1 GB array. **Sequential access:** 4 GB/s. **Random access:** 200 MB/s. The difference is **20 times** - this is the cost of cache misses. The processor idles, waiting for data from RAM.
**Cache associativity.** Where to place a memory block in the cache? - **Direct-mapped:** each address → strictly defined cache line. Fast, but conflicts. - **Fully associative:** can be in any line. No conflicts, but slow search. - **N-way set associative:** compromise. For example, 8-way: a block can be in any of 8 lines of a set.
False sharing: a hidden performance enemy
Two threads work with different variables, but they are in the same cache line (64 bytes). Each change invalidates the cache of the other core → constant cache misses. **Solution:** padding to place variables in different cache lines. This gives a 5-10x boost in multithreaded code.
A program traverses an int[1024][1024] array. Row-major order gives 100 MB/s, column-major - 10 MB/s. What's the reason?
TLB: Page Table Cache
**TLB (Translation Lookaside Buffer)** is a cache for address translation virtual address → physical address. Without TLB, each memory access would require 3-4 additional RAM reads (page table walk). TLB reduces this to zero.
When a program accesses address `0x7fff1234`, the processor must: 1. Take the higher bits (page number): `0x7fff` 2. Find the physical page address in the page table 3. Add the offset (`0x1234`) The page table is a multi-level structure in RAM. Walking it takes 3-4 memory reads (~400 nanoseconds)!
**TLB is a cache for Page Table Entries (PTE).** Size: 64-512 entries. Each entry: virtual page → physical page. **TLB hit**: translation in 1 cycle. **TLB miss:** processor performs page table walk in RAM.
**TLB miss penalty** is huge: if the TLB miss rate increases from 1% to 5%, the program can slow down by **50%**. This happens when working with large memory volumes (databases, scientific computations).
TLB thrashing in the real world
A database scans a table of size 10 GB. Memory page = 4 KB. 2.5 million TLB entries are needed, but the processor has only 512. **Result:** TLB miss on each access → performance drops 3-5 times. **Solution:** huge pages (2 MB instead of 4 KB) - 512 times fewer TLB entries needed!
**TLB shootdown** - an expensive operation in multi-core systems. When the OS changes the page table (e.g., unmaps a page), it must invalidate the TLB on **all CPU cores**. This requires an inter-processor interrupt (IPI).
Optimization: PCID avoids TLB flush
Previously, a context switch (process switch) flushed the entire TLB - all 512 entries were discarded. Modern CPUs (Intel since 2010) support **PCID (Process Context ID)**: each TLB entry is tagged with a process ID. Context switching no longer requires TLB flush - entries of different processes coexist in the TLB. Context switch acceleration: 20-30%.
A database works with 20 GB of memory, TLB holds 512 entries, page size is 4 KB. TLB miss rate = 98%. What will improve performance?
Page Cache: Disk Cache in RAM
**Page cache** (or disk cache) is a cache of file contents in RAM. Reading from disk takes ~10 milliseconds (HDD) or ~100 microseconds (SSD). Reading from RAM - 100 nanoseconds. The difference is **100,000 times** (HDD) or **1000 times** (SSD).
Reading a file via `read()` doesn't cause the OS to go to disk directly. It first checks the page cache: maybe this data is already in memory? If yes - instant response. If not - reads from disk and caches for future use.
**Page cache uses all free RAM.** In Linux, the `free` command shows: ``` total used free buff/cache available Mem: 16GB 2GB 1GB 13GB 14GB ``` 13 GB in **buff/cache** - this is the page cache. It is not "occupied" irreversibly: if an application needs memory, the OS will instantly free the cache.
Page cache for databases
PostgreSQL reads a table of size 10 GB. First full scan: 100 seconds (reading from SSD). Second scan: 5 seconds (from page cache) - **20x faster**. That's why production databases always run on servers with huge RAM: page cache is free acceleration. A server with 256 GB RAM can cache the entire working dataset of a DB.
**Read-ahead** - the OS anticipates sequential access patterns. During sequential file reading, the kernel automatically loads the next pages into the cache before the application requests them. This hides disk latency.
**Write-back** - a `write()` call places data in the page cache, and the OS flushes it to disk **later** (asynchronously). This gives a huge speedup but is risky: if the system crashes before flush - data is lost.
Write optimization: batching
A program writes a log: 1000 lines/sec, each with `fsync()`. HDD does 100 IOPS → **10x overload**. Solution: batching - collect 10 lines, one `fsync()`. Throughput: 100 lines/sec → 1000 lines/sec. This is the basis of write-ahead log (WAL) in databases: group transactions, one fsync per batch.
An application writes 10,000 log lines to a file, each entry with fsync(). HDD supports 100 IOPS. How long will the write take?
Buffer Cache: File System Metadata
**Buffer cache** is a cache of file system metadata: inodes, directory entries, block bitmaps. Previously (before Linux 2.4), buffer cache was separate from page cache. Now they are unified, but conceptually differ in purpose.
**Page cache** stores file contents. **Buffer cache** stores the structures of the file system itself: where the file is located on disk, which blocks are free, access rights. Without buffer cache, each `ls`, `stat()`, `open()` would require reading metadata from disk.
**Dentry cache (dcache)** - a cache of file paths. Opening `/home/user/file.txt` causes the OS to remember: `"/home/user/file.txt" → inode 12345`. The next `open()` of the same file is instant - no need to traverse directories.
Buffer cache for git operations
Run `git status` in a repository with 10,000 files. First run: 2 seconds (reading all inodes/dentries from disk). Second run immediately after: 0.1 second - **20x faster**. Dentry cache remembered the directory structure. This is why IDEs and git work so fast with repeated operations.
**Block bitmaps and allocation metadata** - another type of buffered data. When the file system allocates a new block for a file, it reads the block bitmap (bitmap of free blocks). These structures are frequently modified, so it's critical to keep them in RAM.
**Journal/Log in file systems** (ext4, XFS) is also buffered. The journal is a circular log of metadata changes. Transactions are first written to the journal (sequentially, quickly), then asynchronously applied to the main structures. Journal in RAM → transactions are instant.
Optimization: tmpfs for temporary files
Compiling a large project creates 100,000 `.o` files in `/tmp`. On a regular FS: each file → inode allocation, block allocation, directory update. Huge load on buffer cache and disk. **Solution:** `tmpfs` - a file system in RAM. No disk operations, everything in memory. Compilation speeds up 2-3 times. After reboot, tmpfs is cleared - perfect for /tmp.
Page cache and buffer cache are the same thing, different names for one entity
Page cache and buffer cache are conceptually different: page cache stores file contents, buffer cache - file system metadata (inodes, dentries, bitmaps)
Although in modern Linux (since kernel 2.4) they are unified into one subsystem (unified page cache), semantically they solve different tasks. **Page cache:** during `read(fd, buf, 1024)` - data is cached here. This is the file content. **Buffer cache:** during `open("/path/file")` - the OS must traverse the path, read inodes, directory entries. These metadata are cached in dentry/inode cache (part of buffer cache). Without buffer cache, even a simple `ls` would require dozens of disk accesses to read directory structures. Understanding the separation of concepts is important for knowing which parts of the cache to drop (`echo 1` vs `echo 2` in `/proc/sys/vm/drop_caches`), and how to tune FS for different loads.
A program creates 10,000 small files (each 1 KB). Why is buffer cache critical for performance?
Key Ideas
- **CPU cache hides RAM latency.** L1/L2/L3 caches provide data access in 1-12 ns instead of 100 ns (RAM). Cache miss penalty is huge: a 100x difference. Optimizing code for cache locality gives a 10-20x boost.
- **TLB caches virtual → physical translation.** Without TLB, each memory access would require 3-4 additional RAM reads (page table walk). TLB hit rate 95-99% - critical for performance. Huge pages (2 MB) reduce TLB pressure by 512 times.
- **Page cache turns disk into RAM.** Repeated file reads from page cache are 1000x faster (SSD) or 100,000x faster (HDD). OS uses all free RAM for page cache - it's not "wasted" memory, but an accelerator. Read-ahead and write-back hide disk latency.
- **Buffer cache speeds up FS metadata.** Inodes, dentries, block bitmaps in RAM → file operations (open, stat, ls) are instant. Creating 10,000 files: 40,000 disk IO without buffer cache, ~10 IO with cache. Difference: 100 seconds vs 1 second.
Related Topics
Caching is a fundamental idea applicable at all levels of the stack:
- Virtual Memory — Page table walk is cached via TLB. Without TLB, virtual memory would be 3-5x slower. Huge pages reduce TLB pressure for large applications
- File Systems — Page cache and buffer cache are the foundation of FS performance. Journaling, block allocation, directory lookups - all depend on metadata caching
- System Programming — Understanding caches is critical for optimization: when to use mmap vs read, how to avoid false sharing, why aligned allocations by cache line (64 bytes)
- Processor Architecture — CPU cache hierarchy (L1/L2/L3), cache coherence protocols (MESI), prefetching - all these are hardware mechanisms managed by the OS
Вопросы для размышления
- If an application processes 1 TB of data, and RAM is only 16 GB, how can page cache still provide acceleration? Hint: hot data vs cold data.
- TLB on the processor holds 512 entries. The program works with 10 GB of memory. How do huge pages (2 MB) help if pages are chosen randomly (random access)?
- A database performs fsync() after each transaction for reliability, but this kills performance (10 ms on HDD). How to combine speed and reliability? Hint: group commit.