Operating Systems

Advanced Synchronization

The Big Kernel Lock lived in Linux from the arrival of SMP in 1996, holding the entire kernel under one lock on every syscall. In May 2011 Arnd Bergmann tore out the last reference in 2.6.39, the finale of a decade-long marathon to move the kernel onto fine-grained spinlocks, RCU, and lock-free structures. The cost of synchronization shows in latency: spinlock ~10ns uncontended, ~100ns under cache-line contention; futex wakeup ~25ns; a full mutex with a kernel trip ~1µs. The gap between 'scales to 128 cores' and 'one lock per syscall' is a question of which primitive gets picked.

**Linux kernel routing:** RCU allows processing 200M packet lookups/sec without locks. Critical for data centers, where every nanosecond = money.
**LMAX Disruptor:** Wait-free ring buffer with latency of 50-100ns. High-frequency trading, where microsecond delay = loss of millions of dollars.
**PostgreSQL MVCC:** Readers do not block writers through Multi-Version Concurrency Control (similar to RCU). Allows handling thousands of transactions simultaneously.

Цели урока

Distinguish spinlock from mutex: spinlock wins on short critical sections in multi-core systems
RW-lock: optimization for read-heavy workloads and the writer starvation problem
RCU (Read-Copy-Update): zero-cost readers in the Linux kernel, grace period for writers
Lock-free vs wait-free, CAS (compare-and-swap), the ABA problem
Know when lock-free loses to locked: high contention and complex memory ordering

Spinlocks - Busy Waiting

**Spinlock** is a synchronization primitive where a thread **actively spins in a loop** (spinning), checking the lock instead of sleeping. Unlike a mutex, which hands control over to the scheduler, a spinlock burns CPU cycles while waiting.

**When to use a spinlock:** - The critical section is **very short** (a few instructions) - Context switching is more expensive than busy waiting - In kernel space, where sleeping is impossible (interrupt handlers) - On multi-core systems (on a single core, spinlock = deadlock)

**Why the `pause` instruction?** Without it, the CPU aggressively speculates, loading cache lines. `pause` (x86) or `yield` (ARM) tells the processor: "this is a spinloop, don't speculate." It reduces power consumption and increases performance by 2-10x.

Spinlock in Linux kernel

Linux uses spinlock to protect data structures in interrupt context, where sleeping is not allowed. Example: `spin_lock_irqsave()` locks a spinlock **and disables interrupts on the current CPU**, preventing deadlock. ```c // kernel/sched/core.c - Linux scheduler spin_lock_irqsave(&rq->lock, flags); update_rq_clock(rq); // Critical section: ~10 instructions spin_unlock_irqrestore(&rq->lock, flags); ``` The critical section takes nanoseconds - a context switch would be more expensive.

**Danger of spinlock in userspace:** If a thread holds a spinlock and is preempted by the scheduler, other threads burn CPU uselessly until the next quantum (milliseconds!). Therefore, in userspace, **futex** or **adaptive mutex** is preferred.

Adaptive mutex - hybrid

Some pthread mutex implementations (e.g., glibc) use **adaptive spinning**: first spin for a few iterations, then sleep via futex. This gives the best of both worlds: - If the lock is released quickly - avoid context switch - If waiting long - don't burn CPU uselessly

**Spinlock on a single-core CPU = death!** On a single-core system, a spinlock held by one thread will never be released (the owner won't get CPU). Always check: `if (num_cpus > 1)`.

Why is a spinlock suitable for a critical section in a Linux kernel interrupt handler but not for a long operation in userspace?

RW-locks - Optimization for Read-Heavy Workloads

**Read-Write Lock (rwlock)** allows **multiple readers** to hold the lock simultaneously, but **a writer is exclusive**. Ideal for data structures where reads are 100x more frequent than writes (caches, routing tables, configurations).

**Rwlock semantics:** - **Read lock (shared):** Multiple threads can read simultaneously - **Write lock (exclusive):** Only one writer, blocks all readers - **Fairness issue:** A constant stream of readers can starve the writer (writer starvation)

**Writer starvation issue:** If there are many readers constantly acquiring the lock, a writer may wait indefinitely. Linux solves this with a **queued rwlock** with writer priority: if a writer is waiting, new readers are blocked.

Routing table in networking stack

The routing table is rarely updated (when the network changes) but read on every packet (millions of times per second). rwlock is ideal: ```c // net/ipv4/fib_trie.c - Forwarding Information Base read_lock(&fib_lock); struct fib_result res; fib_lookup(net, &fl4, &res, 0); // Route lookup read_unlock(&fib_lock); // Update (rare): write_lock(&fib_lock); fib_table_insert(tb, &cfg); // Modify table write_unlock(&fib_lock); ``` Throughput: ~10M lookups/sec without reader locks.

**Seq-lock - optimization for extremely rare writes.** Readers do not block at all! Instead, they check a **sequence counter** before and after reading. If the counter changed - retry. The writer increments the counter. Used for system time in Linux (`xtime_lock`).

When is rwlock worse than a regular mutex?

If writes are as frequent as reads (50/50), rwlock is **slower** than a mutex! Reason: bookkeeping (counting readers) is more expensive than a simple atomic flag. Profile before optimizing: - **90%+ reads** → rwlock wins - **50-60% reads** → mutex is faster - **99%+ reads, rare writes** → RCU (next concept)

In which scenario will rwlock provide the greatest performance boost compared to a regular mutex?

RCU - Readers Without Locks

**Read-Copy-Update (RCU)** is a revolutionary synchronization technique from the Linux kernel. Readers **do not take locks at all**! The idea: a writer creates a new version of the data, waits for all readers to finish with the old version, then frees the old one. RCU is the holy grail for read-heavy structures.

**RCU key concepts:** - **Grace Period** - the period between publishing a new version and freeing the old one. Guarantee: all readers that started before publishing have finished - **Quiescent State** - a moment when a thread is guaranteed not to hold RCU-protected pointers - **Reclamation** - safe deletion of old versions after the grace period

**Why doesn't `rcu_read_lock()` block?** It's just an increment of a thread-local counter! No atomic, no cache coherence traffic. Cost: ~1-2 CPU cycles. Compare with mutex (~50-100 cycles) or even atomic increment (~20 cycles).

RCU in Linux routing table

Linux uses RCU for FIB (Forwarding Information Base) - the routing table. Every network packet does a lookup (millions per second), updates are rare. ```c // net/ipv4/fib_semantics.c rcu_read_lock(); nh = fib_info_nh(fi, 0); // Find next hop if (nh->nh_dev == dev) { // Use the old version, even if someone is updating! } rcu_read_unlock(); ``` Performance: 40M lookups/sec on a single core without cache misses!

**Grace Period implementation:** How to know all readers are done? Linux tracks **context switches**: if all CPUs have gone through the scheduler, all RCU critical sections are complete. This can take milliseconds, so `synchronize_rcu()` is a **blocking call**.

When NOT to use RCU?

RCU is not a silver bullet: 1. **Frequent writes** - each update creates a copy. Expensive for write-heavy workloads 2. **Large structures** - copying a megabyte-sized structure will kill performance 3. **Strict order needed** - readers may see old versions until the grace period Example: request counter. Each request is a write, copying the counter is pointless. Atomic increment is much simpler.

**Memory ordering in RCU is critical!** `rcu_assign_pointer()` includes a **memory barrier**: ensures the new version is fully initialized before publication. Without the barrier, readers could see partially updated data (reordering!).

Why does RCU use Copy-on-Write instead of directly updating the data structure?

Lock-Free Data Structures

**Lock-free data structures** are the holy grail of multithreading. Guarantee: at least **one thread always progresses**, even if others are stalled. No locks, only **atomic operations** and **Compare-And-Swap (CAS)**. Difficult to implement, but provide incredible performance.

**Progress guarantees levels:** - **Wait-free:** Each operation completes in **a bounded number of steps**. Ideal, but extremely complex - **Lock-free:** **At least one thread** progresses. Retry may be infinite for unlucky threads - **Obstruction-free:** Progress if a thread works **alone** (without contention) - **Blocking (lock-based):** Locking can lead to a complete halt

**ABA problem - a tricky bug in lock-free structures.** Scenario: thread A reads head=X, falls asleep. Thread B removes X, then allocates X again (same address!). Thread A wakes up, CAS succeeds (same address), but data has changed! Solution: **tagged pointers** or **hazard pointers**.

ABA problem visualization

``` Initially: Stack = [A → B → C] Thread 1: pop() 1. Reads head=A, next=B 2. Sleeps before CAS Thread 2: 1. pop() → removes A 2. pop() → removes B 3. push(A) → allocates A again (same address!) Stack = [A → C] Thread 1 wakes up: 1. CAS(&head, A, B) - SUCCESS! (address A is the same) 2. head = B - but B is already removed! 3. Stack → [B (freed!) → ???] → CRASH ``` Solution: store `{pointer, version}` instead of a pointer.

**Memory ordering in lock-free code is critical!** CPU and compiler reorder instructions. Without memory barriers, partially updated data becomes visible. Standard models: **Acquire/Release** (most platforms), **Sequential Consistency** (strongest, slowest).

Lock-free queue in Facebook Folly

Facebook developed `folly::MPMCQueue` - Multiple-Producer-Multiple-Consumer lock-free queue. Used in thousands of services: ```cpp folly::MPMCQueue<int> queue(1024); // Fixed size // Producer (any number of threads): queue.blockingWrite(42); // Wait-free if space available // Consumer (any number of threads): int value; queue.blockingRead(value); // Wait-free if elements available ``` Performance: 40M ops/sec on 8 cores. Regular queue with mutex: ~5M ops/sec.

**When NOT to use lock-free?** Complexity grows exponentially. If the critical section is short and contention is low - a regular mutex is **simpler and faster**. Lock-free is justified for: 1) High contention (>8 threads), 2) Latency is critical, 3) Real-time systems.

Real-world: Disruptor pattern

LMAX Exchange (high-frequency trading) developed **Disruptor** - a wait-free ring buffer. Latency: **50-100 nanoseconds** for message passing between threads! Idea: producers and consumers work with different parts of the ring buffer, minimizing cache contention. Each element has a sequence number. Coordination through atomic increments. Regular queue with locks: ~1-10 microseconds. Disruptor: 100x faster.

Key Ideas

**Spinlocks - for short critical sections in kernel space.** Busy waiting is cheaper than context switch if the lock is held for nanoseconds. Dangerous in userspace (preemption by scheduler).
**RW-locks optimize read-heavy workloads (90%+ reads).** Multiple readers in parallel, writer is exclusive. Writer starvation issue is solved through queued locks.
**RCU - readers without locks through Copy-on-Write.** New version is published atomically, old one is deleted after grace period. Ideal for routing tables, where 99%+ reads.
**Lock-free structures guarantee progress without locks.** Through CAS and atomic operations. Complex (ABA problem, memory ordering), but provide incredible performance under high contention (16+ threads).

Вопросы для размышления

Designing an in-memory cache for a web server: 95% reads, 5% writes, 64 cores. Which synchronization primitive fits best: mutex, rwlock, RCU, or lock-free map? Why?
Why does the Linux kernel use different primitives for different contexts: spinlock in interrupt handlers, mutex in process context, RCU for routing? What are the trade-offs?
A lock-free algorithm benchmarks 10x faster than a mutex. When is it the right choice, and when is a mutex still preferable? What are the risks?