Parallel Computing

Message Passing: MPI

1993: 14 months that unified supercomputers

Before MPI, parallel programming resembled cross-platform development without a common SDK - except worse. Cray T3D, Intel Paragon, nCUBE/2: three machines, three incompatible APIs, three parallel universes. Code portability between supercomputers was not a concept that existed. April 1992, Williamsburg, Virginia: the first meeting of what would become the MPI Forum. 60 organizations, 14 months of work. MPI 1.0 shipped in May 1994. Within a year it dominated. Thirty years later, all 500 of the fastest computers on the planet run on MPI. The standard describes an interface, not an implementation - which is why Open MPI, MPICH, Cray MPI, and IBM Spectrum MPI are each tuned to their hardware yet speak the same language.

MPI remains the HPC standard. CERN (LHC computations), NOAA (WRF weather forecasting), NASA (CFD simulations), DeepMind AlphaFold2 - all use MPI or MPI-compatible libraries. PyTorch NCCL for distributed training implements the same collective communication patterns MPI formalized in 1994.

MPI_Send and MPI_Recv: blocking point-to-point

1993. A conference room in Texas. Representatives from 60 organizations - IBM, Intel, Cray, universities from around the world - sit down with one goal: a unified standard for parallel computing. Until that moment, every supercomputer spoke its own dialect. Cray T3D had one API, Intel Paragon another, nCUBE/2 a third. Code portability between machines simply did not exist as a concept. The MPI Forum worked for 14 months. The result lives in every one of the top-500 supercomputers in the world - 30 years later.

The foundation of MPI is the distributed-memory model. Each process lives in isolation: its own address space, no shared heap. The only way to exchange data is explicit message passing. This is not a limitation - it is a design decision. It scales from 4 cores to 4 million (IBM Sequoia reached 1.6M cores).

MPI_Send is a blocking call. It will not return until the data is copied into the system buffer - or received by the destination, depending on implementation and message size. For small messages (under 8 KB): eager protocol - data copied to a buffer, send returns immediately. For large messages: rendezvous protocol - send blocks until the receiver is ready and acknowledges. This invisible difference produces entire bug classes that never appear in test data but crash production.

Non-blocking MPI_Isend / MPI_Irecv initiate the operation and return a request handle immediately. Data transfers in the background while the CPU keeps computing. The classic pattern - overlap computation and communication: while boundary cells go out to neighbor processes, interior cells are already being processed. In climate models like MPAS-Ocean (used by the National Center for Atmospheric Research) this overlap delivers 40% speedup at thousands of cores.

Process 0 calls MPI_Send with a 100 MB buffer. Process 1 has not called MPI_Recv yet. What happens in an implementation with rendezvous protocol?

Bcast, Reduce, Scatter, Gather: tree over chain

Consider computing the sum of one billion numbers across 1000 processors. Each computes its own million - a local partial sum. Then 1000 partial sums must be merged into one global result. Using MPI_Send/Recv: 999 sequential operations. Using MPI_Reduce: a reduction tree with depth $\log_2(1000) \approx 10$. A factor of 100 difference in the number of communication steps.

The reduction tree is not an abstraction - it is hardware topology. Open MPI implements several variants: binomial tree (optimal for latency-bound), binary tree (bandwidth-bound). For GPU clusters, NCCL (NVIDIA) implements Ring-Allreduce - the same algorithm running inside PyTorch DDP. Every backward pass iteration training BERT is an Allreduce of gradients across all GPUs. Without MPI-derived semantics under the hood, scalable LLM training would not exist.

MPI_Allreduce = MPI_Reduce + MPI_Bcast combined into a single optimized ring operation. The result is available to all processes, not just root. In ML training this is critical: every GPU needs the averaged gradient to update weights synchronously.

MPI_Scatter and MPI_Gather are inverses of each other. Scatter: root splits an array into N equal parts and distributes one chunk per process. Gather: collects chunks back. Together they implement fork-join data parallelism - the same pattern that MapReduce, Apache Spark, and Hadoop inherited from MPI in the 2000s.

MPI_Bcast broadcasts a single value from root to all processes. Used when root loads configuration or hyperparameters and must synchronize them before computation starts. In DeepSpeed (Microsoft, the infrastructure behind GPT-4 class model training) MPI_Bcast handles weight initialization so all GPU nodes begin from an identical state.

8 processes. The result of a reduction (sum) must be available to every process, not only root. What is the most efficient approach?

MPI Deadlock: when everyone waits for everyone

Two processes. Both want to exchange data. Both call MPI_Send first. MPI_Send with rendezvous protocol blocks - no one has called MPI_Recv. The system freezes. Permanently. This is not a hypothetical scenario - it is one of the most common causes of hangs in real MPI programs, especially when message sizes are large.

MPI_Sendrecv is an atomic bidirectional exchange. The implementation orders send and recv internally to prevent circular wait. The same logic as Dijkstra's Banker algorithm - embedded into the MPI runtime so one does not need to reason about ordering manually.

The insidious case: deadlock appears only for large N due to the eager/rendezvous protocol difference. The program passes tests with small arrays and hangs in production. Detection tools: MUST (Marmot MPI correctness tool) or Intel Trace Analyzer - they intercept MPI calls and build a wait-for graph to identify the cycle.

Collective operations - MPI_Bcast, MPI_Reduce, MPI_Barrier - do not deadlock as long as all processes in the communicator call them. The rule is simple: a collective operation must be called by every process in the group. If even one process skips MPI_Barrier, the rest wait forever. Conditional collective calls (`if (rank == 0) MPI_Barrier(...)`) are a classic antipattern.

MPI_Barrier is a synchronization point: all processes wait until the last one arrives. The distributed-memory analog of pthread_barrier_wait. Useful for separating computation phases, but every barrier is gated by the slowest process. Too many barriers destroy scalability.

Code: `if (rank % 2 == 0) MPI_Barrier(MPI_COMM_WORLD);`. The program runs on 4 processes. What happens?

Summary

MPI: message passing standard for distributed-memory systems. 30 years, top-500 supercomputers.
MPI_Send/Recv: blocking point-to-point. Small messages - eager (buffered). Large - rendezvous (receiver handshake required).
MPI_Isend/Irecv: non-blocking. Overlap communication and computation - the key to scalability.
Collective operations (Bcast, Reduce, Scatter, Gather, Allreduce): tree/ring topology. Ring-Allreduce lives in PyTorch DDP.
MPI deadlock: circular blocking send. Fix - MPI_Sendrecv or ordered operations.
Collective calls must be invoked on ALL processes in the communicator - no exceptions, no conditions.

Connections to other topics

MPI is the foundation of HPC and distributed computing. These topics extend the picture.

Actor Model (Erlang, Akka) — High-level abstraction of message passing with the same process isolation principles
CSP and Channels (Go) — Alternative formalism for message passing with synchronous channel semantics
Consensus: Paxos and Raft — Distributed protocols built on top of message passing primitives
Deadlock, Livelock, Starvation — MPI deadlock is the distributed-memory instance of the same root causes

Вопросы для размышления

PyTorch DDP uses Ring-Allreduce for gradient synchronization - an algorithm from the MPI world. Training LLaMA-3 on 2048 GPUs requires Allreduce of all parameters (70B float16 = 140 GB) every backward pass. What tradeoffs arise between synchronization frequency (every batch vs gradient accumulation over N steps) and model convergence quality?

Связанные уроки

par-05 — Shared memory is the counterpart and contrast to message passing
par-04 — MPI deadlock is a distributed-memory instance of classic deadlock
par-07 — Actor Model is built on the same message passing isolation principles
ds-01-intro — MPI is the foundation for understanding distributed systems
ds-03-consensus — Paxos/Raft use the same collective communication patterns
dist-03-fallacies
net-53-distributed-intro