Parallel Computing

Why We Need Parallelism

NVIDIA H100 - 16,896 CUDA cores. GPT-4 trained across 25,000 A100s simultaneously. PyTorch uses all CPU cores and the full GPU by default. And yet naively written code runs on a single core out of all of them. Here is the catch: parallelism does not automatically speed up arbitrary code. Amdahl's Law is cold and precise: if 5% of code is sequential, the ceiling is 20x - regardless of how many cores exist. This is exactly why PyTorch separates data parallelism and model parallelism. That is not a design preference - it is math.

  • **LLM training:** GPT-4 trained on 25,000 A100s for ~100 days. A single A100 would take 6,800 years. Data parallelism + model parallelism is the only way to fit training into any reasonable time budget
  • **GPU inference:** H100 delivers 79.5 TFLOPS FP32 and 3,958 TFLOPS FP8. That 50x gap comes entirely from parallelism inside one chip - 16,896 cores computing independent tensor slices simultaneously
  • **Web servers:** Nginx holds 10,000+ connections on one core via concurrency - not parallelism. Confusing the two is a real engineering mistake: it costs 10x performance when the wrong model is chosen

Moore's Law and the End of the Clock Speed Race

2005. Intel quietly kills the Pentium 4 Prescott - running at 3.8 GHz and hot as a space heater. The 4+ GHz successor was ready. It was cancelled. Physics said no: processor power scales as the cube of frequency. A 10 GHz chip would burn hundreds of watts and melt anything around it. Since then transistors have kept doubling - but clock speed became a constant.

YearProcessorClock SpeedTransistorsCores
1993Pentium60 MHz3.1M1
2000Pentium 41.5 GHz42M1
2004Pentium 4 Prescott3.8 GHz125M1
2006Core 2 Duo2.4 GHz291M2 ← the pivot!
2024Core i9-14900K6.0 GHz20,000M24

**Moore's Law** (1965): the number of transistors on a chip doubles roughly every ~2 years. The law still holds! But the **corollary** "processors get faster every year" stopped being true. Transistors now go toward **more cores**, not **higher clock speeds**.

Herb Sutter called it "The Free Lunch Is Over" in 2005 - and he was right. A program written in 2000 would speed up on its own by 2004: Intel shipped a faster chip and everything just ran faster. That stopped. Single-threaded code on a Core i9-14900K runs at roughly the same speed as on a Core 2 Quad from 2008. Seventy times more transistors - invisible to single-threaded code.

Herb Sutter: The Free Lunch Is Over (2005)

Herb Sutter's article "The Free Lunch Is Over" became the manifesto of parallel programming. He warned: the era when a program speeds up simply by buying a new processor is over. Developers will have to learn parallelism - or accept performance stagnation.

Why did processor clock speeds stop growing after 2005?

Amdahl's Law

Add more cores - sounds like the solution. Gene Amdahl showed in 1967 why it is not. **The sequential portion of a program is an absolute ceiling.** No amount of cores expands a bottleneck. This is the reason PyTorch separates data parallelism (each GPU holds its own model copy) from model parallelism (model layers split across GPUs): without breaking the sequential bottleneck, no number of H100s closes the gap.

**Sequential bottleneck:** if 5% of the code cannot be parallelized, the maximum speedup is 20x. No matter how many cores are added. This is why optimizing the sequential portion is often more important than adding cores.

**Gustafson's Law** (1988) - a more optimistic view: as the number of processors grows, we also increase the **problem size**. If the parallel portion scales with data, speedup grows linearly: S(p) = p - s·(p-1). This is relevant for big data and scientific computing.

A program has 20% sequential I/O and 80% parallel computation. How many cores are needed for a 4x speedup?

Parallelism vs Concurrency

Two words that get conflated constantly - and the cost is real. An engineer says "let's parallelize this" and reaches for `asyncio`. Gets concurrency. Wonders why the CPU-bound code is not faster. **Parallelism** - multiple things happen simultaneously on separate cores. **Concurrency** - one core intelligently interleaves tasks, not wasting time on I/O waits. Rob Pike nailed it: concurrency is about the structure of the program, parallelism is about execution.

PropertyParallelismConcurrency
EssenceSimultaneous executionManaging multiple tasks
RequiresMultiple cores/processorsCan run on 1 core
GoalSpeed up computationResponsiveness, I/O overlap
ExampleRendering 4 frames simultaneouslyWeb server handling 1000 requests
Analogy4 checkout lanes in a store1 lane with fast switching

Rob Pike (creator of Go): "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." Concurrency is about the **structure** of a program. Parallelism is about **execution**. Concurrency exists without parallelism (async on 1 core) and parallelism exists without concurrency (SIMD instructions).

Node.js is pure concurrency: one thread, event loop, thousands of connections - but a single CPU-bound call freezes everything. NumPy and PyTorch are pure parallelism: BLAS operations across multiple cores via OpenBLAS or MKL, SIMD instructions, zero `async` in the API. Knowing which model fits the problem means choosing the right tool rather than the plausible one.

A Node.js web server handles 10,000 simultaneous HTTP requests on a single core. This is an example of:

Speedup Metrics

"We parallelized it and it got faster" is a feeling, not a measurement. Three metrics turn the feeling into a number. **Speedup** - how many times faster. **Efficiency** - how honestly each core is being used. **Scalability** - what happens when the core count doubles. Without these numbers, it is impossible to say whether the bottleneck is the algorithm, the hardware, or the sequential portion.

Scalability TypeSpeedupExample
Linear (ideal)S(p) = pFully independent tasks
Sublinear (typical)S(p) < pMost real-world programs
Superlinear (rare)S(p) > pData fits in cache after splitting

**Superlinear speedup** (S > p) seems paradoxical, but is possible: when data is split across cores, each chunk fits in L2/L3 cache while the original dataset does not. The cache effect provides an additional boost on top of parallelism.

**Strong scaling** - fixed workload, increasing cores. Hits Amdahl fast. **Weak scaling** - workload and cores grow together. This is what makes supercomputers actually useful: a job on 1,000 GPUs is simply 1,000 times larger than a job on one - each GPU handles its own share. Distributed LLM training works exactly this way.

Don't forget the **overhead** of parallelism: thread creation, synchronization, communication. For small tasks the overhead may exceed the gain, and the parallel version can end up **slower** than single-threaded!

8 cores = 8x faster

Speedup is limited by the sequential portion of the code (Amdahl's Law) and parallelism overhead. 8 cores typically give 3–6x speedup.

Amdahl's Law: 10% sequential code means a 10x ceiling no matter how many cores exist. That is the theoretical maximum with zero overhead. In the real world, thread creation costs ~50 µs each, mutex synchronization adds contention, cache lines bounce between cores in tens of nanoseconds. PyTorch knows this - `torch.compile` and TorchDynamo optimize sequential graph sections specifically, not just throw more cores at the problem.

A program runs in 30s on 4 cores and in 100s on 1 core. What is the parallelization efficiency?

Key Ideas

  • **The power wall ended the free lunch (2005):** clock frequency became a constant, single-threaded code stopped automatically getting faster with new hardware - transistors now go into more cores, not higher speeds
  • **Amdahl's Law is the cold shower:** 5% sequential code means a 20x ceiling no matter how many cores exist. Optimizing the sequential portion beats adding more cores
  • **Parallelism vs concurrency are different tools:** parallelism is multiple cores executing simultaneously (PyTorch, NumPy, BLAS); concurrency is one core intelligently interleaving I/O-bound work (Node.js, asyncio)
  • **Metrics make it real:** Speedup = T₁/Tₚ, Efficiency = S/p. 8 cores at 70% efficiency = 5.6x actual speedup, not 8x - overhead and Amdahl eat the rest

Related Topics

Parallelism is connected to many areas of CS:

  • Threads and Processes — Basic primitives for implementing parallelism
  • Synchronization — Coordinating parallel threads - mutex, semaphore, barrier
  • Operating Systems — The OS manages processes, threads, and scheduling on cores

Вопросы для размышления

  • Why is Amdahl's Law so pessimistic, yet real supercomputers with millions of cores are still useful?
  • Can concurrency speed up a CPU-bound task? Can parallelism speed up an I/O-bound task?
  • Back to the start: if a phone has 8 cores, why do apps still lag?

Связанные уроки

  • par-02 — Threads and processes are the basic primitives through which parallelism is realized in practice.
  • os-01-intro — The OS manages core scheduling; understanding OS processes makes Amdahl's law concrete.
  • opt-01 — Parallelism is a performance optimization tool; Amdahl's law formalizes the trade-off between sequential and parallel portions.
  • alg-01-big-o — Algorithm complexity sets the ceiling for parallelization: an O(n²) algorithm with 50% sequential portion gains nothing from 100 cores.
  • calc-06-derivative-intro — The speedup curve as a function of cores is diminishing marginal returns - the same principle as in calculus and economics.
  • arch-04-cpu
  • os-02-processes
Why We Need Parallelism

0

1

Sign In