Parallel Computing
Threads and Processes
Dijkstra, Hoare, and the Birth of Synchronization Theory
1965: Edsger Dijkstra formalizes the semaphore and describes the critical section problem. His paper 'Solution of a Problem in Concurrent Programming Control' is the first rigorous definition of a race condition. 1968: Dijkstra introduces Dining Philosophers - the canonical deadlock example. 1972: Flynn's taxonomy (SISD/SIMD/MIMD) provides the theoretical framework for parallel architectures. 1978: Hoare develops CSP (Communicating Sequential Processes) - an alternative approach using message passing instead of shared memory. Go channels and Erlang actors are direct descendants of this idea.
Node.js handles 100,000 requests per second in a single thread. Java Tomcat by default creates a thread per request and hits a wall at 1000 concurrent connections. Not because one language is faster. Because thread and process architecture determines where the bottleneck lives.
- **Chrome** - one process per tab and extension: a crash in one tab does not kill the browser
- **Python gunicorn** - launches worker processes (not threads) to bypass the GIL: 4 workers = 4 cores fully utilized
- **Java Tomcat** - thread pool of 200 threads: one thread per HTTP request, ~50 KB overhead each
- **Nginx** - worker processes equal to core count, each in an event loop: no context switches under load
- **Go runtime** - 1 goroutine = ~2 KB vs ~50 KB OS thread: one million goroutines genuinely work
- **LMAX Disruptor** - ring buffer + busy-wait threads: 6 million transactions/s with zero context switches
Предварительные знания
Threads
PyTorch training on 8 GPUs. Each GPU worker is a thread. All threads see the same model weights in memory. Gradients update from multiple threads simultaneously - without synchronization, weights diverge. That is the core insight: **threads** are multiple lines of execution inside one process, sharing the same address space. Speed and danger come from the same source.
| What threads share | What each thread owns |
|---|---|
| Address space (heap) | Stack ~8 MB by default |
| Open files and sockets | CPU registers (RAX, RBX, RSP, RIP) |
| Global variables | Program counter (PC) |
| Program code | Local variables |
| Allocated memory | Thread ID + signal mask |
A **thread** is the smallest unit of execution scheduled by the OS. Creating a thread costs ~10-50 KB of stack and ~1 microsecond of CPU time. That is why Java Tomcat spawns hundreds of threads under load. The problem is not creation - it is **race conditions**: when two threads read and write the same memory location without synchronization, the result is undefined.
Thousands of threads become a bottleneck - not because of memory, but because of context switching. Linux switches threads every 4-15 ms on a timer. With 1000 threads on 8 cores - 125 threads per core, a switch every 50-100 microseconds. L1/L2 cache is perpetually cold for each incoming thread. This is exactly why Go goroutines and Rust tokio use a user-space M:N scheduler - hundreds of thousands of green threads on a small number of OS threads.
Two threads simultaneously execute shared_counter += 1. What result is possible?
Processes
Chrome keeps a separate process for each tab. A JavaScript page spinning in an infinite loop, consuming a full core - other tabs do not slow down. That is **process isolation**: each process lives in its own address space. The kernel enforces boundaries through the MMU. One process physically cannot read another's memory without explicit IPC.
| Property | Threads | Processes |
|---|---|---|
| Memory | Shared | Isolated (private) |
| Creation | ~1 µs, ~50 KB | ~1 ms, ~MB (fork copies page tables) |
| Communication | Via shared memory - nanoseconds | IPC: pipe/socket/shm - microseconds to milliseconds |
| Crash isolation | One thread crashes - whole process dies | One process crashes - others survive |
| Safety | Race conditions, shared state bugs | MMU isolation by default |
| GIL (Python) | GIL blocks CPU parallelism | Each process has its own GIL |
**IPC** (Inter-Process Communication): **pipes** - one-directional byte stream, ~50 MB/s; **shared memory** - common RAM region via mmap, memory bus speed; **sockets** - universal, works over network, ~10 µs overhead; **message queues** - POSIX mq, asynchronous. Gunicorn and uWSGI choose processes specifically to bypass Python's GIL - each worker independently saturates one core.
Python's GIL prevents threads from running Python code in parallel. How can this be worked around for CPU-bound tasks?
Context Switch and Its Hidden Cost
Linux switches a thread on a timer every 4-15 ms. Direct cost: save 168 bytes of registers, load the next 168 bytes, update the stack pointer. That is ~1-5 µs. The real cost is 10-100x higher - **cache pollution**. After the switch, L1/L2 cache is filled with the previous thread's data. The incoming thread spends the first 50-200 µs working at cache-miss latency instead of cache-hit latency.
| Operation | Direct cost | Effective cost |
|---|---|---|
| Thread -> Thread (same process) | ~1-5 µs | ~50 µs with cache warmup |
| Process -> Process | ~5-30 µs | ~100-300 µs with TLB flush + cache |
| TLB miss after switch | ~100 ns x N | N = thousands on a cold start |
| L1 cache miss | ~4 ns | vs L1 hit ~1 ns - 4x difference |
| L2 cache miss | ~12 ns | Adds latency to every memory operation |
This is why **thread pools** outperform creating a thread per request. Java ExecutorService, Python ThreadPoolExecutor, Go goroutines - all maintain pre-allocated thread pools. LMAX Disruptor (trading system, 6 million transactions/s) goes further: worker threads in a busy-wait loop, zero context switches. Cost: 100% CPU per core even at idle.
A server with 4 cores runs 400 CPU-bound threads. What happens to performance?
Scheduling: CFS and EEVDF
Who decides which thread runs next? The kernel **scheduler**. Linux since 2007 uses **CFS** (Completely Fair Scheduler). The idea is compact: each process has a virtual runtime (vruntime) - a counter of how much CPU it has received. The next process to run is the one with the smallest vruntime. Data structure: red-black tree, O(log n) to find the minimum.
| Algorithm | Principle | Used in |
|---|---|---|
| Round Robin | Each thread gets a fixed 10-100 ms time slice | Embedded systems, educational OSes |
| Priority-based | Highest priority runs first | SCHED_FIFO/SCHED_RR in Linux for realtime |
| CFS (Linux 2.6.23+) | vruntime - whoever worked less runs next | All normal Linux/Android processes |
| EEVDF (Linux 6.6+) | CFS + earliest eligible virtual deadline | Interactive tasks, low-latency workloads |
**CFS vruntime**: after each time slice vruntime(process) += delta_exec * (NICE_0_WEIGHT / weight). A process with nice=-20 gets ~88x more CPU than one with nice=+19. For realtime tasks Linux provides SCHED_FIFO (runs until explicit yield) and SCHED_RR (round-robin among RT processes) with priority over CFS. Audio servers (PulseAudio, PipeWire) use SCHED_RR to guarantee bounded latency.
Threads are always better than processes because they are lighter
Threads are faster to create and share data, but processes are safer (crash isolation) and bypass the GIL in Python. Go goroutines are a compromise: user-space threads with M:N multiplexing onto OS threads.
Threads share memory - fast communication, but race conditions and one crashing thread kills the entire process. Chrome chose processes for security (sandbox). Nginx chose worker processes for stability. Python gunicorn for GIL bypass. The choice depends on requirements: threads for shared-state work, processes for isolation and CPU-bound tasks.
In Linux CFS: process A (nice=0) and process B (nice=10). Who gets more CPU time?
Key Ideas
- **Threads** share process memory - nanosecond communication, but race conditions and one crash kills all
- **Processes** are MMU-isolated - crash safety and Python GIL bypass, but IPC costs microseconds to milliseconds
- **Context switch** costs ~1-30 µs directly, but cache pollution adds an effective 50-300 µs
- **CFS** in Linux: red-black tree by vruntime, O(log n), nice=-20 gets ~88x more CPU than nice=+19
- **Go goroutines** and **Rust tokio** bypass OS context switches via M:N multiplexing - hundreds of thousands of coroutines on dozens of OS threads
- For CPU-bound work: N threads/processes matching core count; for I/O-bound: async/event loop is more efficient
Related Topics
Threads and processes are the foundation for understanding parallel systems:
- Synchronization — Mutex, semaphore, barrier - tools for coordinating threads
- Why We Need Parallelism — Threads and processes implement the parallelism from the previous lesson
- OS: Threads and Scheduler — How the Linux kernel creates and schedules threads
Вопросы для размышления
- Why did Go choose goroutines over OS threads? What does a user-space M:N scheduler gain?
- If Python had no GIL, would the choice between threading and multiprocessing change?
- Why is Nginx with worker_processes=4 and an event loop per process more efficient than 400 threads on 4 cores?
Связанные уроки
- par-01 — Motivation for parallelism from the previous lesson
- par-03 — Thread coordination via mutex, semaphore, barrier
- os-02-processes — Kernel internals of process management
- os-03-threads — Thread scheduling and context switch mechanics
- os-04-scheduling — Scheduler algorithms: CFS, EEVDF, SCHED_FIFO
- arch-14-multicore — Multicore hardware - the substrate threads run on