Operating Systems

Virtual Memory

How does a system keep 50 Chrome tabs, Photoshop, and an IDE open simultaneously with only 16 GB of RAM? Virtual memory is the abstraction that turns limited physical memory into the illusion of limitless space. Each program receives an isolated address space covering the entire machine - all transparently managed by the OS.

  • **Why does Chrome use so much memory?** Each tab is a separate process with a virtual address space of tens of gigabytes. But physically, only ~2-3 GB is used thanks to paging and shared memory. Paging allows Chrome to sacrifice memory for security (tab isolation).
  • **How does swap work on SSD?** macOS aggressively uses swap even with 32 GB of RAM. Why? By evicting rarely used pages (e.g., application initialization code), the OS frees up RAM for active data and file cache. Result: applications launch faster.
  • **Meltdown and Spectre: virtual memory vulnerabilities.** The 2018 attacks exploited speculative execution to read kernel memory via a TLB side-channel. The KPTI patch caused a performance drop of 5-30% due to TLB flush on every syscall. Security vs speed.

Цели урока

  • Explain paging: fixed 4K page size, virtual-to-physical mapping via page tables
  • Understand multi-level page tables (x86-64: 4 levels, 48-bit virtual address)
  • Know the TLB: hardware translation cache, ~99% hit rate, miss costs tens of ns
  • Compare page replacement: FIFO, LRU, Clock, Linux active/inactive lists
  • Distinguish demand paging, copy-on-write, mmap and when each applies

Paging: page-based memory organization

**Virtual Memory** lets every program treat memory as if it owned the whole machine - addresses 0 to 2⁴⁸ - while RAM may be only a few GB. The mechanism that pulls this off is **paging**: memory in fixed-size chunks, translated on the fly by the MMU.

Library with a limited table

A library with millions of books (disk) and a small table for 20 books (physical RAM). A researcher needs 200 books for a dissertation. **Paging** is the librarian (OS) bringing the necessary books as needed. The table limitation is invisible - it seems all 200 books are available simultaneously.

**How does paging work?** Memory is divided into fixed-size **pages** (usually 4 KB). A program's virtual address is split into two parts: 1. **Page number** 2. **Offset within the page** The OS translates the virtual address to a physical one via a **page table**.

**Why exactly 4 KB pages?** - Too small (512 bytes) → huge page table, lots of overhead - Too large (1 MB) → internal fragmentation (program needs 5 KB, 1 MB allocated) - 4 KB is a balance between overhead and fragmentation New CPUs support **huge pages** (2 MB, 1 GB) for databases and HPC.

**Key advantage of paging:** programs can use more memory than physically installed RAM. Inactive pages are offloaded to disk (**swapping**). The program doesn't notice - the OS transparently loads them upon access.

Why does Chrome consume so much RAM?

Chrome launches each tab as a separate process. Each process gets its own **virtual address space** of 4 GB (on 32-bit) or 128 TB (on 64-bit). Open 50 tabs - formally 50 × 4 GB = 200 GB of virtual memory! But physically, Chrome uses ~2-3 GB of RAM. Paging creates this illusion of limitless memory.

**Protection through paging.** Each entry in the page table has **protection bits**: Read/Write/Execute. Attempting to write to a read-only page → **segmentation fault**. The OS isolates processes: process A cannot access the memory of process B - their page tables do not overlap.

A process has a virtual address space of 4 GB, page size of 4 KB, but the computer has only 2 GB of physical RAM. Why doesn't the program crash when accessing the 3rd gigabyte of memory?

Page Tables: structure and multilevel tables

**Page Table** is a data structure that stores the mapping of virtual pages to physical frames. Sounds simple, but the implementation is full of tricks. A naive single-level table would be gigantic!

**Solution: multilevel (hierarchical) page tables.** Instead of one huge table - a tree of tables. x86-64 uses a **4-level structure**: PML4 → PDPT → PD → PT. If virtual pages are not used, the corresponding tables are not even allocated!

**Each entry in the page table (PTE - Page Table Entry)** contains not only the address of the physical frame but also metadata:

**Bits in Page Table Entry (x86-64):** - **Present (P):** Page in RAM or on disk (swap) - **Read/Write (R/W):** Writable or not - **User/Supervisor (U/S):** Accessible to userspace programs - **Accessed (A):** Has the page been read (for page replacement) - **Dirty (D):** Has the page been modified (needs saving on swap) - **Execute Disable (XD/NX):** Protection against code execution in data (against exploits)

Page fault: when the page is not in memory

The program accesses address `0x7fff12345000`. MMU goes through the page tables, finds the PTE, sees `present = 0` → generates a **page fault exception**. CPU transfers control to the OS. OS checks: 1. Is the address valid? (not out of bounds) 2. Are the rights sufficient? (not writing to read-only) 3. Is the page on disk (swap)? → Load into RAM 4. Lazy allocation? → Allocate a physical frame 5. Update PTE, return control The program continues from the same instruction - transparently!

**Copy-on-Write (COW)** - a clever optimization. During `fork()`, Linux does not copy all the parent's memory - both page tables simply point to the same physical frames, but with the `R/W = 0` bit. On a write attempt → page fault → OS makes a real copy of the page.

**Inverted page table** (inverted PT) - an alternative approach. Instead of a table for each virtual address - a table for each physical frame. Size is proportional to physical RAM, not virtual space. Used in some architectures (PowerPC, IA-64), but lookup is slower - requires a hash table.

Why is a multilevel page table more efficient than a single-level one for 64-bit systems?

Translation Lookaside Buffer (TLB)

**Problem:** with a 4-level page table, each memory access requires 4 additional RAM accesses (one per level for each table). This is catastrophically slow! The solution is the **TLB (Translation Lookaside Buffer)** - a hardware cache for page table entries.

TLB is like margin notes in a book.

A reader works through a thick textbook, constantly referring to the same terms. Looking them up in the glossary every time is slow. Instead, margin notes appear: "p. 42: definition of entropy." **TLB** is such notes for the MMU: "virtual page 0x7fff → physical frame 0x1234."

**TLB is a fully associative cache** inside the CPU. Usually 64-512 entries. When accessing a virtual address, the MMU **simultaneously** checks all TLB entries. A match (TLB hit) → instant translation. A miss (TLB miss) → slow page walk through tables.

**TLB characteristics (modern x86-64):** - **L1 DTLB:** 64 entries for data (data TLB) - **L1 ITLB:** 128 entries for instructions (instruction TLB) - **L2 TLB:** 1536 entries (shared for code and data) - **Hit rate:** usually 95-99% (locality of reference) - **Miss penalty:** ~100-200 cycles (page walk) Without TLB, modern processors would be 5-10 times slower!

**TLB flush** - a critical moment. When changing context (switching to another process), the TLB becomes useless - it contains entries from the old process! The OS must **flush the TLB**. On x86, this happens when writing to the CR3 register (pointer to the root page table).

Meltdown and Spectre: TLB vulnerabilities

**Meltdown attacks (2018)** exploited speculative execution of the CPU. The processor speculatively loaded kernel data into the TLB (even if access rights prohibited it), and through a side-channel (timing attack on the cache), secrets could be extracted. Patch: **KPTI (Kernel Page Table Isolation)** - separate kernel and userspace page tables. But this causes a TLB flush on every syscall → performance drop of 5-30%.

**PCID (Process Context ID)** - optimization to avoid TLB flush. Modern CPUs tag each TLB entry with a process identifier (PCID). When switching context, the TLB is not flushed - just the PCID is compared. Saves thousands of cycles on each context switch.

**Huge pages and TLB.** A regular page is 4 KB, a huge page is 2 MB or 1 GB. One TLB entry for a huge page covers 512 times more memory! Databases (PostgreSQL, MySQL) use huge pages to reduce TLB misses.

The TLB speeds up the translation of virtual addresses to physical ones. Why does the OS need to flush the TLB when switching to another process?

Page Replacement: eviction algorithms

**Scenario:** a program uses 4 GB of memory, but only 2 GB of RAM is physically available. Half of the pages are on disk (swap). When accessing a page on disk, a **page fault** occurs, and the OS must load it into RAM. But RAM is full! Some page needs to be **evicted**. Which one? This decision affects performance.

**Optimal algorithm (OPT / Belady's):** evict the page that will not be needed for the longest time. Ideal! But... requires knowledge of the future. Impossible to implement, used only for comparison.

**Main page replacement algorithms:** 1. **FIFO** - First-In-First-Out (oldest page) 2. **LRU** - Least Recently Used (not used for a long time) 3. **Clock (Second Chance)** - approximation of LRU 4. **LFU** - Least Frequently Used (rarely used) 5. **Working Set** - based on locality Linux uses a variation of Clock (approximation of LRU).

**FIFO (First-In-First-Out)** - the simplest algorithm. We maintain a queue of pages, evict the oldest one. Problem: the oldest page might be actively used! A classic example is **Belady's Anomaly**: increasing RAM can lead to more page faults.

**LRU (Least Recently Used)** - evict the page that has not been accessed for the longest time. Intuitively correct (locality of reference), but **expensive to implement accurately**. It requires tracking the time of each access - overhead.

**Clock Algorithm (Second Chance)** - approximation of LRU. We use the **Accessed bit** in the PTE (hardware automatically sets it on access). The OS cycles through the pages (like a clock hand). If `Accessed=1` → reset to 0 (give a second chance). If `Accessed=0` → evict.

Dirty pages and performance

When a page is evicted, the OS checks the **Dirty bit**. If the page hasn't been modified (clean) → it can simply be discarded (the disk copy is up-to-date). If it's dirty → it must be written to disk before eviction. Writing to disk is ~1000x slower! Therefore, the OS prefers to evict clean pages.

**Working Set Model.** Programs exhibit **locality of reference**: at any given time, only a subset of pages is actively used (**working set**). If RAM is less than the working set → constant page faults (**thrashing**). If more → smooth operation.

**Linux Page Replacement:** two-list structure. **Active list** - pages recently used. **Inactive list** - candidates for eviction. The periodic process `kswapd` moves pages between lists based on the Accessed bit. When memory is low, pages are evicted from the inactive list (priority: clean → dirty).

Swap is harmful and should be disabled for maximum performance

Swap is critical for system stability. Without swap, when memory is low, the OS will kill processes (OOM killer)

Common misconception: "swap is slow, disabling it will be faster." In reality, swap serves two roles: 1. **Emergency buffer:** Without swap, when RAM is exhausted, the kernel triggers the OOM Killer, which kills random processes. With swap, the system survives temporary memory spikes. 2. **Eviction of cold pages:** There is data loaded into memory but NOT used (e.g., initialization code that runs once at startup). By swapping them out, the OS frees RAM for active data and filesystem cache. The problem isn't swap, it's **thrashing** (constant swapping of active data). Solution: add RAM or reduce load. But completely disabling swap is like removing a seatbelt to drive faster.

Why does the OS prefer to choose "clean" pages over "dirty" pages for eviction?

Key ideas

  • **Paging - the foundation of virtual memory.** Memory is divided into pages (4 KB), virtual addresses are translated to physical ones via page tables. Programs can use more memory than physically installed RAM, thanks to swapping.
  • **Multilevel page tables save memory.** A single-level table for a 64-bit system would take up hundreds of gigabytes. A hierarchical structure (4 levels on x86-64) allocates tables only for used regions - overhead ~0.2%.
  • **TLB - a performance-critical cache.** Without TLB, each memory access would require 5 RAM accesses (page walk). TLB caches the virtual-to-physical mapping, speeding up translation by ~5 times. TLB miss and thrashing are the main performance enemies.
  • **Page replacement - a trade-off between algorithmic complexity and solution quality.** The optimal algorithm (OPT) requires future knowledge. LRU is good but expensive. Clock (Second Chance) is an approximation of LRU, used in Linux. Swap is not evil, but a survival mechanism when memory is scarce.

Related topics

Virtual memory is the foundation of modern OS, connected with many other concepts:

  • Memory management — Virtual memory is the high-level abstraction. Under the hood: physical frame allocation, buddy allocator, slab cache
  • Processes and threads — Every process has its own virtual address space. fork() uses Copy-on-Write for efficiency
  • File systems — mmap() maps files into virtual memory. The page cache uses the same machinery as virtual memory
  • OS security — Protection bits in PTEs (NX, U/S) prevent exploits. ASLR randomises virtual addresses to defend against attacks

Вопросы для размышления

  • When designing an OS for a device with 512 MB RAM (router, IoT): is swap justified, and which page replacement algorithm minimizes overhead under constrained resources?
  • Huge pages (2 MB) speed up databases but are not used by default for all applications. What trade-offs make them undesirable in the general case?
  • How would virtual memory architecture change if RAM and disk had identical latency? Would TLB and page replacement still be necessary?

Связанные уроки

  • os-07-memory — Virtual memory is a layer above physical memory
  • os-09-filesystems — Memory-mapped files unify VM and the filesystem
  • os-02-processes — Process isolation is implemented through virtual address spaces
  • os-11-security — ASLR and memory protection are features of virtual memory
  • db-14-mvcc
Virtual Memory

0

1

Sign In