Parallel Computing

Why We Need Parallelism

NVIDIA H100 - 16,896 CUDA cores. GPT-4 trained across 25,000 A100s simultaneously. PyTorch uses all CPU cores and the full GPU by default. And yet naively written code runs on a single core out of all of them. Here is the catch: parallelism does not automatically speed up arbitrary code. Amdahl's Law is cold and precise: if 5% of code is sequential, the ceiling is 20x - regardless of how many cores exist. This is exactly why PyTorch separates data parallelism and model parallelism. That is not a design preference - it is math.

**LLM training:** GPT-4 trained on 25,000 A100s for ~100 days. A single A100 would take 6,800 years. Data parallelism + model parallelism is the only way to fit training into any reasonable time budget
**GPU inference:** H100 delivers 79.5 TFLOPS FP32 and 3,958 TFLOPS FP8. That 50x gap comes entirely from parallelism inside one chip - 16,896 cores computing independent tensor slices simultaneously
**Web servers:** Nginx holds 10,000+ connections on one core via concurrency - not parallelism. Confusing the two is a real engineering mistake: it costs 10x performance when the wrong model is chosen

Moore's Law and the End of the Clock Speed Race

2005. Intel quietly kills the Pentium 4 Prescott - running at 3.8 GHz and hot as a space heater. The 4+ GHz successor was ready. It was cancelled. Physics said no: processor power scales as the cube of frequency. A 10 GHz chip would burn hundreds of watts and melt anything around it. Since then transistors have kept doubling - but clock speed became a constant.

Year	Processor	Clock Speed	Transistors	Cores
1993	Pentium	60 MHz	3.1M	1
2000	Pentium 4	1.5 GHz	42M	1
2004	Pentium 4 Prescott	3.8 GHz	125M	1
2006	Core 2 Duo	2.4 GHz	291M	2 ← the pivot!
2024	Core i9-14900K	6.0 GHz	20,000M	24

**Moore's Law** (1965): the number of transistors on a chip doubles roughly every ~2 years. The law still holds! But the **corollary** "processors get faster every year" stopped being true. Transistors now go toward **more cores**, not **higher clock speeds**.

Herb Sutter called it "The Free Lunch Is Over" in 2005 - and he was right. A program written in 2000 would speed up on its own by 2004: Intel shipped a faster chip and everything just ran faster. That stopped. Single-threaded code on a Core i9-14900K runs at roughly the same speed as on a Core 2 Quad from 2008. Seventy times more transistors - invisible to single-threaded code.

Herb Sutter: The Free Lunch Is Over (2005)

Herb Sutter's article "The Free Lunch Is Over" became the manifesto of parallel programming. He warned: the era when a program speeds up simply by buying a new processor is over. Developers will have to learn parallelism - or accept performance stagnation.

Why did processor clock speeds stop growing after 2005?

Amdahl's Law

Add more cores - sounds like the solution. Gene Amdahl showed in 1967 why it is not. **The sequential portion of a program is an absolute ceiling.** No amount of cores expands a bottleneck. This is the reason PyTorch separates data parallelism (each GPU holds its own model copy) from model parallelism (model layers split across GPUs): without breaking the sequential bottleneck, no number of H100s closes the gap.

**Sequential bottleneck:** if 5% of the code cannot be parallelized, the maximum speedup is 20x. No matter how many cores are added. This is why optimizing the sequential portion is often more important than adding cores.

**Gustafson's Law** (1988) - a more optimistic view: as the number of processors grows, we also increase the **problem size**. If the parallel portion scales with data, speedup grows linearly: S(p) = p - s·(p-1). This is relevant for big data and scientific computing.

A program has 20% sequential I/O and 80% parallel computation. How many cores are needed for a 4x speedup?

Parallelism vs Concurrency

Two words that get conflated constantly - and the cost is real. An engineer says "let's parallelize this" and reaches for `asyncio`. Gets concurrency. Wonders why the CPU-bound code is not faster. **Parallelism** - multiple things happen simultaneously on separate cores. **Concurrency** - one core intelligently interleaves tasks, not wasting time on I/O waits. Rob Pike nailed it: concurrency is about the structure of the program, parallelism is about execution.

Property	Parallelism	Concurrency
Essence	Simultaneous execution	Managing multiple tasks
Requires	Multiple cores/processors	Can run on 1 core
Goal	Speed up computation	Responsiveness, I/O overlap
Example	Rendering 4 frames simultaneously	Web server handling 1000 requests
Analogy	4 checkout lanes in a store	1 lane with fast switching

Rob Pike (creator of Go): "Concurrency is about dealing with lots of things at once. Parallelism is about doing lots of things at once." Concurrency is about the **structure** of a program. Parallelism is about **execution**. Concurrency exists without parallelism (async on 1 core) and parallelism exists without concurrency (SIMD instructions).

Node.js is pure concurrency: one thread, event loop, thousands of connections - but a single CPU-bound call freezes everything. NumPy and PyTorch are pure parallelism: BLAS operations across multiple cores via OpenBLAS or MKL, SIMD instructions, zero `async` in the API. Knowing which model fits the problem means choosing the right tool rather than the plausible one.

A Node.js web server handles 10,000 simultaneous HTTP requests on a single core. This is an example of:

Speedup Metrics

"We parallelized it and it got faster" is a feeling, not a measurement. Three metrics turn the feeling into a number. **Speedup** - how many times faster. **Efficiency** - how honestly each core is being used. **Scalability** - what happens when the core count doubles. Without these numbers, it is impossible to say whether the bottleneck is the algorithm, the hardware, or the sequential portion.

Scalability Type	Speedup	Example
Linear (ideal)	S(p) = p	Fully independent tasks
Sublinear (typical)	S(p) < p	Most real-world programs
Superlinear (rare)	S(p) > p	Data fits in cache after splitting

**Superlinear speedup** (S > p) seems paradoxical, but is possible: when data is split across cores, each chunk fits in L2/L3 cache while the original dataset does not. The cache effect provides an additional boost on top of parallelism.

**Strong scaling** - fixed workload, increasing cores. Hits Amdahl fast. **Weak scaling** - workload and cores grow together. This is what makes supercomputers actually useful: a job on 1,000 GPUs is simply 1,000 times larger than a job on one - each GPU handles its own share. Distributed LLM training works exactly this way.

Don't forget the **overhead** of parallelism: thread creation, synchronization, communication. For small tasks the overhead may exceed the gain, and the parallel version can end up **slower** than single-threaded!

8 cores = 8x faster

Speedup is limited by the sequential portion of the code (Amdahl's Law) and parallelism overhead. 8 cores typically give 3–6x speedup.

Amdahl's Law: 10% sequential code means a 10x ceiling no matter how many cores exist. That is the theoretical maximum with zero overhead. In the real world, thread creation costs ~50 µs each, mutex synchronization adds contention, cache lines bounce between cores in tens of nanoseconds. PyTorch knows this - `torch.compile` and TorchDynamo optimize sequential graph sections specifically, not just throw more cores at the problem.

A program runs in 30s on 4 cores and in 100s on 1 core. What is the parallelization efficiency?

Key Ideas

**The power wall ended the free lunch (2005):** clock frequency became a constant, single-threaded code stopped automatically getting faster with new hardware - transistors now go into more cores, not higher speeds
**Amdahl's Law is the cold shower:** 5% sequential code means a 20x ceiling no matter how many cores exist. Optimizing the sequential portion beats adding more cores
**Parallelism vs concurrency are different tools:** parallelism is multiple cores executing simultaneously (PyTorch, NumPy, BLAS); concurrency is one core intelligently interleaving I/O-bound work (Node.js, asyncio)
**Metrics make it real:** Speedup = T₁/Tₚ, Efficiency = S/p. 8 cores at 70% efficiency = 5.6x actual speedup, not 8x - overhead and Amdahl eat the rest

Вопросы для размышления

Why is Amdahl's Law so pessimistic, yet real supercomputers with millions of cores are still useful?
Can concurrency speed up a CPU-bound task? Can parallelism speed up an I/O-bound task?
Back to the start: if a phone has 8 cores, why do apps still lag?

Связанные уроки

par-02 — Threads and processes are the basic primitives through which parallelism is realized in practice.
os-01-intro — The OS manages core scheduling; understanding OS processes makes Amdahl's law concrete.
opt-01 — Parallelism is a performance optimization tool; Amdahl's law formalizes the trade-off between sequential and parallel portions.
alg-01-big-o — Algorithm complexity sets the ceiling for parallelization: an O(n²) algorithm with 50% sequential portion gains nothing from 100 cores.
calc-06-derivative-intro — The speedup curve as a function of cores is diminishing marginal returns - the same principle as in calculus and economics.
arch-04-cpu
os-02-processes

Parallel Computing

Why We Need Parallelism

**LLM training:** GPT-4 trained on 25,000 A100s for ~100 days. A single A100 would take 6,800 years. Data parallelism + model parallelism is the only way to fit training into any reasonable time budget
**GPU inference:** H100 delivers 79.5 TFLOPS FP32 and 3,958 TFLOPS FP8. That 50x gap comes entirely from parallelism inside one chip - 16,896 cores computing independent tensor slices simultaneously
**Web servers:** Nginx holds 10,000+ connections on one core via concurrency - not parallelism. Confusing the two is a real engineering mistake: it costs 10x performance when the wrong model is chosen

Moore's Law and the End of the Clock Speed Race

Year	Processor	Clock Speed	Transistors	Cores
1993	Pentium	60 MHz	3.1M	1
2000	Pentium 4	1.5 GHz	42M	1
2004	Pentium 4 Prescott	3.8 GHz	125M	1
2006	Core 2 Duo	2.4 GHz	291M	2 ← the pivot!
2024	Core i9-14900K	6.0 GHz	20,000M	24

Herb Sutter: The Free Lunch Is Over (2005)

Why did processor clock speeds stop growing after 2005?

Amdahl's Law

A program has 20% sequential I/O and 80% parallel computation. How many cores are needed for a 4x speedup?

Parallelism vs Concurrency

Property	Parallelism	Concurrency
Essence	Simultaneous execution	Managing multiple tasks
Requires	Multiple cores/processors	Can run on 1 core
Goal	Speed up computation	Responsiveness, I/O overlap
Example	Rendering 4 frames simultaneously	Web server handling 1000 requests
Analogy	4 checkout lanes in a store	1 lane with fast switching

A Node.js web server handles 10,000 simultaneous HTTP requests on a single core. This is an example of:

Speedup Metrics

Scalability Type	Speedup	Example
Linear (ideal)	S(p) = p	Fully independent tasks
Sublinear (typical)	S(p) < p	Most real-world programs
Superlinear (rare)	S(p) > p	Data fits in cache after splitting

8 cores = 8x faster

Speedup is limited by the sequential portion of the code (Amdahl's Law) and parallelism overhead. 8 cores typically give 3–6x speedup.

A program runs in 30s on 4 cores and in 100s on 1 core. What is the parallelization efficiency?

Key Ideas

**The power wall ended the free lunch (2005):** clock frequency became a constant, single-threaded code stopped automatically getting faster with new hardware - transistors now go into more cores, not higher speeds
**Amdahl's Law is the cold shower:** 5% sequential code means a 20x ceiling no matter how many cores exist. Optimizing the sequential portion beats adding more cores
**Parallelism vs concurrency are different tools:** parallelism is multiple cores executing simultaneously (PyTorch, NumPy, BLAS); concurrency is one core intelligently interleaving I/O-bound work (Node.js, asyncio)
**Metrics make it real:** Speedup = T₁/Tₚ, Efficiency = S/p. 8 cores at 70% efficiency = 5.6x actual speedup, not 8x - overhead and Amdahl eat the rest

Вопросы для размышления

Why is Amdahl's Law so pessimistic, yet real supercomputers with millions of cores are still useful?
Can concurrency speed up a CPU-bound task? Can parallelism speed up an I/O-bound task?
Back to the start: if a phone has 8 cores, why do apps still lag?

Связанные уроки

par-02 — Threads and processes are the basic primitives through which parallelism is realized in practice.
os-01-intro — The OS manages core scheduling; understanding OS processes makes Amdahl's law concrete.
opt-01 — Parallelism is a performance optimization tool; Amdahl's law formalizes the trade-off between sequential and parallel portions.
alg-01-big-o — Algorithm complexity sets the ceiling for parallelization: an O(n²) algorithm with 50% sequential portion gains nothing from 100 cores.
calc-06-derivative-intro — The speedup curve as a function of cores is diminishing marginal returns - the same principle as in calculus and economics.
arch-04-cpu
os-02-processes

Why We Need Parallelism

Moore's Law and the End of the Clock Speed Race

Herb Sutter: The Free Lunch Is Over (2005)

Amdahl's Law

Parallelism vs Concurrency

Speedup Metrics

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Why We Need Parallelism

Moore's Law and the End of the Clock Speed Race

Herb Sutter: The Free Lunch Is Over (2005)

Amdahl's Law

Parallelism vs Concurrency

Speedup Metrics

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки