Parallel Computing

Threads and Processes

Dijkstra, Hoare, and the Birth of Synchronization Theory

1965: Edsger Dijkstra formalizes the semaphore and describes the critical section problem. His paper 'Solution of a Problem in Concurrent Programming Control' is the first rigorous definition of a race condition. 1968: Dijkstra introduces Dining Philosophers - the canonical deadlock example. 1972: Flynn's taxonomy (SISD/SIMD/MIMD) provides the theoretical framework for parallel architectures. 1978: Hoare develops CSP (Communicating Sequential Processes) - an alternative approach using message passing instead of shared memory. Go channels and Erlang actors are direct descendants of this idea.

Node.js handles 100,000 requests per second in a single thread. Java Tomcat by default creates a thread per request and hits a wall at 1000 concurrent connections. Not because one language is faster. Because thread and process architecture determines where the bottleneck lives.

  • **Chrome** - one process per tab and extension: a crash in one tab does not kill the browser
  • **Python gunicorn** - launches worker processes (not threads) to bypass the GIL: 4 workers = 4 cores fully utilized
  • **Java Tomcat** - thread pool of 200 threads: one thread per HTTP request, ~50 KB overhead each
  • **Nginx** - worker processes equal to core count, each in an event loop: no context switches under load
  • **Go runtime** - 1 goroutine = ~2 KB vs ~50 KB OS thread: one million goroutines genuinely work
  • **LMAX Disruptor** - ring buffer + busy-wait threads: 6 million transactions/s with zero context switches

Предварительные знания

  • Why We Need Parallelism

Threads

PyTorch training on 8 GPUs. Each GPU worker is a thread. All threads see the same model weights in memory. Gradients update from multiple threads simultaneously - without synchronization, weights diverge. That is the core insight: **threads** are multiple lines of execution inside one process, sharing the same address space. Speed and danger come from the same source.

What threads shareWhat each thread owns
Address space (heap)Stack ~8 MB by default
Open files and socketsCPU registers (RAX, RBX, RSP, RIP)
Global variablesProgram counter (PC)
Program codeLocal variables
Allocated memoryThread ID + signal mask

A **thread** is the smallest unit of execution scheduled by the OS. Creating a thread costs ~10-50 KB of stack and ~1 microsecond of CPU time. That is why Java Tomcat spawns hundreds of threads under load. The problem is not creation - it is **race conditions**: when two threads read and write the same memory location without synchronization, the result is undefined.

Thousands of threads become a bottleneck - not because of memory, but because of context switching. Linux switches threads every 4-15 ms on a timer. With 1000 threads on 8 cores - 125 threads per core, a switch every 50-100 microseconds. L1/L2 cache is perpetually cold for each incoming thread. This is exactly why Go goroutines and Rust tokio use a user-space M:N scheduler - hundreds of thousands of green threads on a small number of OS threads.

Two threads simultaneously execute shared_counter += 1. What result is possible?

Processes

Chrome keeps a separate process for each tab. A JavaScript page spinning in an infinite loop, consuming a full core - other tabs do not slow down. That is **process isolation**: each process lives in its own address space. The kernel enforces boundaries through the MMU. One process physically cannot read another's memory without explicit IPC.

PropertyThreadsProcesses
MemorySharedIsolated (private)
Creation~1 µs, ~50 KB~1 ms, ~MB (fork copies page tables)
CommunicationVia shared memory - nanosecondsIPC: pipe/socket/shm - microseconds to milliseconds
Crash isolationOne thread crashes - whole process diesOne process crashes - others survive
SafetyRace conditions, shared state bugsMMU isolation by default
GIL (Python)GIL blocks CPU parallelismEach process has its own GIL

**IPC** (Inter-Process Communication): **pipes** - one-directional byte stream, ~50 MB/s; **shared memory** - common RAM region via mmap, memory bus speed; **sockets** - universal, works over network, ~10 µs overhead; **message queues** - POSIX mq, asynchronous. Gunicorn and uWSGI choose processes specifically to bypass Python's GIL - each worker independently saturates one core.

Python's GIL prevents threads from running Python code in parallel. How can this be worked around for CPU-bound tasks?

Context Switch and Its Hidden Cost

Linux switches a thread on a timer every 4-15 ms. Direct cost: save 168 bytes of registers, load the next 168 bytes, update the stack pointer. That is ~1-5 µs. The real cost is 10-100x higher - **cache pollution**. After the switch, L1/L2 cache is filled with the previous thread's data. The incoming thread spends the first 50-200 µs working at cache-miss latency instead of cache-hit latency.

OperationDirect costEffective cost
Thread -> Thread (same process)~1-5 µs~50 µs with cache warmup
Process -> Process~5-30 µs~100-300 µs with TLB flush + cache
TLB miss after switch~100 ns x NN = thousands on a cold start
L1 cache miss~4 nsvs L1 hit ~1 ns - 4x difference
L2 cache miss~12 nsAdds latency to every memory operation

This is why **thread pools** outperform creating a thread per request. Java ExecutorService, Python ThreadPoolExecutor, Go goroutines - all maintain pre-allocated thread pools. LMAX Disruptor (trading system, 6 million transactions/s) goes further: worker threads in a busy-wait loop, zero context switches. Cost: 100% CPU per core even at idle.

A server with 4 cores runs 400 CPU-bound threads. What happens to performance?

Scheduling: CFS and EEVDF

Who decides which thread runs next? The kernel **scheduler**. Linux since 2007 uses **CFS** (Completely Fair Scheduler). The idea is compact: each process has a virtual runtime (vruntime) - a counter of how much CPU it has received. The next process to run is the one with the smallest vruntime. Data structure: red-black tree, O(log n) to find the minimum.

AlgorithmPrincipleUsed in
Round RobinEach thread gets a fixed 10-100 ms time sliceEmbedded systems, educational OSes
Priority-basedHighest priority runs firstSCHED_FIFO/SCHED_RR in Linux for realtime
CFS (Linux 2.6.23+)vruntime - whoever worked less runs nextAll normal Linux/Android processes
EEVDF (Linux 6.6+)CFS + earliest eligible virtual deadlineInteractive tasks, low-latency workloads

**CFS vruntime**: after each time slice vruntime(process) += delta_exec * (NICE_0_WEIGHT / weight). A process with nice=-20 gets ~88x more CPU than one with nice=+19. For realtime tasks Linux provides SCHED_FIFO (runs until explicit yield) and SCHED_RR (round-robin among RT processes) with priority over CFS. Audio servers (PulseAudio, PipeWire) use SCHED_RR to guarantee bounded latency.

Threads are always better than processes because they are lighter

Threads are faster to create and share data, but processes are safer (crash isolation) and bypass the GIL in Python. Go goroutines are a compromise: user-space threads with M:N multiplexing onto OS threads.

Threads share memory - fast communication, but race conditions and one crashing thread kills the entire process. Chrome chose processes for security (sandbox). Nginx chose worker processes for stability. Python gunicorn for GIL bypass. The choice depends on requirements: threads for shared-state work, processes for isolation and CPU-bound tasks.

In Linux CFS: process A (nice=0) and process B (nice=10). Who gets more CPU time?

Key Ideas

  • **Threads** share process memory - nanosecond communication, but race conditions and one crash kills all
  • **Processes** are MMU-isolated - crash safety and Python GIL bypass, but IPC costs microseconds to milliseconds
  • **Context switch** costs ~1-30 µs directly, but cache pollution adds an effective 50-300 µs
  • **CFS** in Linux: red-black tree by vruntime, O(log n), nice=-20 gets ~88x more CPU than nice=+19
  • **Go goroutines** and **Rust tokio** bypass OS context switches via M:N multiplexing - hundreds of thousands of coroutines on dozens of OS threads
  • For CPU-bound work: N threads/processes matching core count; for I/O-bound: async/event loop is more efficient

Related Topics

Threads and processes are the foundation for understanding parallel systems:

  • Synchronization — Mutex, semaphore, barrier - tools for coordinating threads
  • Why We Need Parallelism — Threads and processes implement the parallelism from the previous lesson
  • OS: Threads and Scheduler — How the Linux kernel creates and schedules threads

Вопросы для размышления

  • Why did Go choose goroutines over OS threads? What does a user-space M:N scheduler gain?
  • If Python had no GIL, would the choice between threading and multiprocessing change?
  • Why is Nginx with worker_processes=4 and an event loop per process more efficient than 400 threads on 4 cores?

Связанные уроки

  • par-01 — Motivation for parallelism from the previous lesson
  • par-03 — Thread coordination via mutex, semaphore, barrier
  • os-02-processes — Kernel internals of process management
  • os-03-threads — Thread scheduling and context switch mechanics
  • os-04-scheduling — Scheduler algorithms: CFS, EEVDF, SCHED_FIFO
  • arch-14-multicore — Multicore hardware - the substrate threads run on
Threads and Processes

0

1

Sign In