Parallel Computing

Threads and Processes

Dijkstra, Hoare, and the Birth of Synchronization Theory

1965: Edsger Dijkstra formalizes the semaphore and describes the critical section problem. His paper 'Solution of a Problem in Concurrent Programming Control' is the first rigorous definition of a race condition. 1968: Dijkstra introduces Dining Philosophers - the canonical deadlock example. 1972: Flynn's taxonomy (SISD/SIMD/MIMD) provides the theoretical framework for parallel architectures. 1978: Hoare develops CSP (Communicating Sequential Processes) - an alternative approach using message passing instead of shared memory. Go channels and Erlang actors are direct descendants of this idea.

Node.js handles 100,000 requests per second in a single thread. Java Tomcat by default creates a thread per request and hits a wall at 1000 concurrent connections. Not because one language is faster. Because thread and process architecture determines where the bottleneck lives.

**Chrome** - one process per tab and extension: a crash in one tab does not kill the browser
**Python gunicorn** - launches worker processes (not threads) to bypass the GIL: 4 workers = 4 cores fully utilized
**Java Tomcat** - thread pool of 200 threads: one thread per HTTP request, ~50 KB overhead each
**Nginx** - worker processes equal to core count, each in an event loop: no context switches under load
**Go runtime** - 1 goroutine = ~2 KB vs ~50 KB OS thread: one million goroutines genuinely work
**LMAX Disruptor** - ring buffer + busy-wait threads: 6 million transactions/s with zero context switches

Предварительные знания

Why We Need Parallelism

Threads

PyTorch training on 8 GPUs. Each GPU worker is a thread. All threads see the same model weights in memory. Gradients update from multiple threads simultaneously - without synchronization, weights diverge. That is the core insight: **threads** are multiple lines of execution inside one process, sharing the same address space. Speed and danger come from the same source.

What threads share	What each thread owns
Address space (heap)	Stack ~8 MB by default
Open files and sockets	CPU registers (RAX, RBX, RSP, RIP)
Global variables	Program counter (PC)
Program code	Local variables
Allocated memory	Thread ID + signal mask

A **thread** is the smallest unit of execution scheduled by the OS. Creating a thread costs ~10-50 KB of stack and ~1 microsecond of CPU time. That is why Java Tomcat spawns hundreds of threads under load. The problem is not creation - it is **race conditions**: when two threads read and write the same memory location without synchronization, the result is undefined.

Thousands of threads become a bottleneck - not because of memory, but because of context switching. Linux switches threads every 4-15 ms on a timer. With 1000 threads on 8 cores - 125 threads per core, a switch every 50-100 microseconds. L1/L2 cache is perpetually cold for each incoming thread. This is exactly why Go goroutines and Rust tokio use a user-space M:N scheduler - hundreds of thousands of green threads on a small number of OS threads.

Two threads simultaneously execute shared_counter += 1. What result is possible?

Processes

Chrome keeps a separate process for each tab. A JavaScript page spinning in an infinite loop, consuming a full core - other tabs do not slow down. That is **process isolation**: each process lives in its own address space. The kernel enforces boundaries through the MMU. One process physically cannot read another's memory without explicit IPC.

Property	Threads	Processes
Memory	Shared	Isolated (private)
Creation	~1 µs, ~50 KB	~1 ms, ~MB (fork copies page tables)
Communication	Via shared memory - nanoseconds	IPC: pipe/socket/shm - microseconds to milliseconds
Crash isolation	One thread crashes - whole process dies	One process crashes - others survive
Safety	Race conditions, shared state bugs	MMU isolation by default
GIL (Python)	GIL blocks CPU parallelism	Each process has its own GIL

**IPC** (Inter-Process Communication): **pipes** - one-directional byte stream, ~50 MB/s; **shared memory** - common RAM region via mmap, memory bus speed; **sockets** - universal, works over network, ~10 µs overhead; **message queues** - POSIX mq, asynchronous. Gunicorn and uWSGI choose processes specifically to bypass Python's GIL - each worker independently saturates one core.

Python's GIL prevents threads from running Python code in parallel. How can this be worked around for CPU-bound tasks?

Context Switch and Its Hidden Cost

Linux switches a thread on a timer every 4-15 ms. Direct cost: save 168 bytes of registers, load the next 168 bytes, update the stack pointer. That is ~1-5 µs. The real cost is 10-100x higher - **cache pollution**. After the switch, L1/L2 cache is filled with the previous thread's data. The incoming thread spends the first 50-200 µs working at cache-miss latency instead of cache-hit latency.

Operation	Direct cost	Effective cost
Thread -> Thread (same process)	~1-5 µs	~50 µs with cache warmup
Process -> Process	~5-30 µs	~100-300 µs with TLB flush + cache
TLB miss after switch	~100 ns x N	N = thousands on a cold start
L1 cache miss	~4 ns	vs L1 hit ~1 ns - 4x difference
L2 cache miss	~12 ns	Adds latency to every memory operation

This is why **thread pools** outperform creating a thread per request. Java ExecutorService, Python ThreadPoolExecutor, Go goroutines - all maintain pre-allocated thread pools. LMAX Disruptor (trading system, 6 million transactions/s) goes further: worker threads in a busy-wait loop, zero context switches. Cost: 100% CPU per core even at idle.

A server with 4 cores runs 400 CPU-bound threads. What happens to performance?

Scheduling: CFS and EEVDF

Who decides which thread runs next? The kernel **scheduler**. Linux since 2007 uses **CFS** (Completely Fair Scheduler). The idea is compact: each process has a virtual runtime (vruntime) - a counter of how much CPU it has received. The next process to run is the one with the smallest vruntime. Data structure: red-black tree, O(log n) to find the minimum.

Algorithm	Principle	Used in
Round Robin	Each thread gets a fixed 10-100 ms time slice	Embedded systems, educational OSes
Priority-based	Highest priority runs first	SCHED_FIFO/SCHED_RR in Linux for realtime
CFS (Linux 2.6.23+)	vruntime - whoever worked less runs next	All normal Linux/Android processes
EEVDF (Linux 6.6+)	CFS + earliest eligible virtual deadline	Interactive tasks, low-latency workloads

**CFS vruntime**: after each time slice vruntime(process) += delta_exec * (NICE_0_WEIGHT / weight). A process with nice=-20 gets ~88x more CPU than one with nice=+19. For realtime tasks Linux provides SCHED_FIFO (runs until explicit yield) and SCHED_RR (round-robin among RT processes) with priority over CFS. Audio servers (PulseAudio, PipeWire) use SCHED_RR to guarantee bounded latency.

Threads are always better than processes because they are lighter

Threads are faster to create and share data, but processes are safer (crash isolation) and bypass the GIL in Python. Go goroutines are a compromise: user-space threads with M:N multiplexing onto OS threads.

Threads share memory - fast communication, but race conditions and one crashing thread kills the entire process. Chrome chose processes for security (sandbox). Nginx chose worker processes for stability. Python gunicorn for GIL bypass. The choice depends on requirements: threads for shared-state work, processes for isolation and CPU-bound tasks.

In Linux CFS: process A (nice=0) and process B (nice=10). Who gets more CPU time?

Key Ideas

**Threads** share process memory - nanosecond communication, but race conditions and one crash kills all
**Processes** are MMU-isolated - crash safety and Python GIL bypass, but IPC costs microseconds to milliseconds
**Context switch** costs ~1-30 µs directly, but cache pollution adds an effective 50-300 µs
**CFS** in Linux: red-black tree by vruntime, O(log n), nice=-20 gets ~88x more CPU than nice=+19
**Go goroutines** and **Rust tokio** bypass OS context switches via M:N multiplexing - hundreds of thousands of coroutines on dozens of OS threads
For CPU-bound work: N threads/processes matching core count; for I/O-bound: async/event loop is more efficient

Вопросы для размышления

Why did Go choose goroutines over OS threads? What does a user-space M:N scheduler gain?
If Python had no GIL, would the choice between threading and multiprocessing change?
Why is Nginx with worker_processes=4 and an event loop per process more efficient than 400 threads on 4 cores?

Связанные уроки

par-01 — Motivation for parallelism from the previous lesson
par-03 — Thread coordination via mutex, semaphore, barrier
os-02-processes — Kernel internals of process management
os-03-threads — Thread scheduling and context switch mechanics
os-04-scheduling — Scheduler algorithms: CFS, EEVDF, SCHED_FIFO
arch-14-multicore — Multicore hardware - the substrate threads run on

Parallel Computing

Threads and Processes

Dijkstra, Hoare, and the Birth of Synchronization Theory

**Chrome** - one process per tab and extension: a crash in one tab does not kill the browser
**Python gunicorn** - launches worker processes (not threads) to bypass the GIL: 4 workers = 4 cores fully utilized
**Java Tomcat** - thread pool of 200 threads: one thread per HTTP request, ~50 KB overhead each
**Nginx** - worker processes equal to core count, each in an event loop: no context switches under load
**Go runtime** - 1 goroutine = ~2 KB vs ~50 KB OS thread: one million goroutines genuinely work
**LMAX Disruptor** - ring buffer + busy-wait threads: 6 million transactions/s with zero context switches

Предварительные знания

Why We Need Parallelism

Threads

What threads share	What each thread owns
Address space (heap)	Stack ~8 MB by default
Open files and sockets	CPU registers (RAX, RBX, RSP, RIP)
Global variables	Program counter (PC)
Program code	Local variables
Allocated memory	Thread ID + signal mask

Two threads simultaneously execute shared_counter += 1. What result is possible?

Processes

Property	Threads	Processes
Memory	Shared	Isolated (private)
Creation	~1 µs, ~50 KB	~1 ms, ~MB (fork copies page tables)
Communication	Via shared memory - nanoseconds	IPC: pipe/socket/shm - microseconds to milliseconds
Crash isolation	One thread crashes - whole process dies	One process crashes - others survive
Safety	Race conditions, shared state bugs	MMU isolation by default
GIL (Python)	GIL blocks CPU parallelism	Each process has its own GIL

Python's GIL prevents threads from running Python code in parallel. How can this be worked around for CPU-bound tasks?

Context Switch and Its Hidden Cost

Operation	Direct cost	Effective cost
Thread -> Thread (same process)	~1-5 µs	~50 µs with cache warmup
Process -> Process	~5-30 µs	~100-300 µs with TLB flush + cache
TLB miss after switch	~100 ns x N	N = thousands on a cold start
L1 cache miss	~4 ns	vs L1 hit ~1 ns - 4x difference
L2 cache miss	~12 ns	Adds latency to every memory operation

A server with 4 cores runs 400 CPU-bound threads. What happens to performance?

Scheduling: CFS and EEVDF

Algorithm	Principle	Used in
Round Robin	Each thread gets a fixed 10-100 ms time slice	Embedded systems, educational OSes
Priority-based	Highest priority runs first	SCHED_FIFO/SCHED_RR in Linux for realtime
CFS (Linux 2.6.23+)	vruntime - whoever worked less runs next	All normal Linux/Android processes
EEVDF (Linux 6.6+)	CFS + earliest eligible virtual deadline	Interactive tasks, low-latency workloads

Threads are always better than processes because they are lighter

In Linux CFS: process A (nice=0) and process B (nice=10). Who gets more CPU time?

Key Ideas

**Threads** share process memory - nanosecond communication, but race conditions and one crash kills all
**Processes** are MMU-isolated - crash safety and Python GIL bypass, but IPC costs microseconds to milliseconds
**Context switch** costs ~1-30 µs directly, but cache pollution adds an effective 50-300 µs
**CFS** in Linux: red-black tree by vruntime, O(log n), nice=-20 gets ~88x more CPU than nice=+19
**Go goroutines** and **Rust tokio** bypass OS context switches via M:N multiplexing - hundreds of thousands of coroutines on dozens of OS threads
For CPU-bound work: N threads/processes matching core count; for I/O-bound: async/event loop is more efficient

Вопросы для размышления

Why did Go choose goroutines over OS threads? What does a user-space M:N scheduler gain?
If Python had no GIL, would the choice between threading and multiprocessing change?
Why is Nginx with worker_processes=4 and an event loop per process more efficient than 400 threads on 4 cores?

Связанные уроки

par-01 — Motivation for parallelism from the previous lesson
par-03 — Thread coordination via mutex, semaphore, barrier
os-02-processes — Kernel internals of process management
os-03-threads — Thread scheduling and context switch mechanics
os-04-scheduling — Scheduler algorithms: CFS, EEVDF, SCHED_FIFO
arch-14-multicore — Multicore hardware - the substrate threads run on

Threads and Processes

Dijkstra, Hoare, and the Birth of Synchronization Theory

Предварительные знания

Threads

Processes

Context Switch and Its Hidden Cost

Scheduling: CFS and EEVDF

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Threads and Processes

Dijkstra, Hoare, and the Birth of Synchronization Theory

Предварительные знания

Threads

Processes

Context Switch and Its Hidden Cost

Scheduling: CFS and EEVDF

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки