Scientific Computing
Parallel Scientific Computing
Buying twice as many servers to get twice the speed? Not guaranteed. AlphaFold 2 from DeepMind achieved 80% GPU efficiency on 128 TPUs. Most scientific codes degrade to 30-40% at scale. The difference is in sound decomposition patterns, load balancing, and communication design.
- **Climate model CESM:** 3D domain decomposition of atmosphere + ocean, 10,000+ MPI tasks, weak-scaling to ~100K cores
- **DistributedDataParallel (PyTorch):** gradient AllReduce via NCCL, ring-allreduce algorithm for minimal communication overhead
- **FLASH (astrophysics code):** Adaptive Mesh Refinement with dynamic balancing via Zoltan, strong-scaling to 300K cores
Domain Decomposition
The task: simulate Earth's atmosphere on a 1000x1000x100 cell grid. No single process has enough memory or time. **Domain Decomposition** cuts the spatial domain into pieces and assigns each piece to its own process. Physics within each piece is independent; interaction only occurs at boundaries.
**Ghost cells (halo cells):** each process stores 1-2 layers of cells from its neighbors. This allows computing finite differences at boundaries without extra neighbor queries during computation. Ghost cell exchange happens once per simulation step.
A 1000x1000 grid is split among 4 processes using 1D (strip) decomposition. How many boundary cells must be exchanged between P0 and P1 per step?
Load Balancing in HPC
Equal spatial slices do not mean equal work. In an N-body galaxy simulation: thousands of stars cluster at the center, the periphery is nearly empty. If the domain is split uniformly by coordinates, central processes work 100x longer than peripheral ones. All that waiting time is wasted. **Load balancing** distributes computational load, not data volume.
8 processes run a task. 7 finish in 10 seconds, one finishes in 30 seconds. What is the total parallel wall-clock time?
Communication Patterns: Scatter/Gather
Communication in MPI programs is not just send/recv. Different patterns have collective operations optimized at the network level: reduction trees, butterfly algorithms for AllReduce, pipelining. Knowing the patterns means choosing the right MPI call instead of building it manually.
**Non-blocking communication:** `MPI_Isend`/`MPI_Irecv` start a transfer and return immediately. While data travels over the network, the process computes the interior. `MPI_Wait` blocks until completion. This overlap of computation and communication is the key to high efficiency.
Which MPI operation best computes the sum of local residuals to check convergence, when all processes need the result?
Strong vs Weak Scaling
Adding 100 more servers - does the program run 100x faster? Almost never. Amdahl's Law explains why: the sequential portion of code places an absolute ceiling on speedup. There are two ways to evaluate a parallel program: **strong scaling** (same problem, more processes) and **weak scaling** (problem grows proportionally with process count).
**Roofline Model:** a diagnostic tool for bottleneck analysis. X-axis: arithmetic intensity (FLOP/byte), Y-axis: performance (GFLOPS). A program's working point is either memory-bound (left of the roof) or compute-bound (right). Memory-bound code benefits from better cache usage; compute-bound code benefits from SIMD/GPU.
A program has 5% sequential code (Amdahl's Law). What is the maximum achievable speedup regardless of processor count?
Parallel Scientific Computing Patterns
- **Domain Decomposition:** partition the spatial domain + ghost cells for boundary exchange; 2D/3D decomposition beats 1D by reducing the communication-to-computation ratio
- **Load Balancing:** equal volume != equal work; static partition for uniform problems, work stealing for irregular ones
- **Collective Communications:** Scatter/Gather/AllReduce are network-optimized; non-blocking Isend/Irecv overlaps communication with computation
- **Scaling Laws:** strong scaling - faster same task (Amdahl ceiling = 1/s); weak scaling - bigger task in same time (Gustafson); target efficiency >= 70%
Related Topics
Parallel patterns are applied together with HPC tools and numerical methods.
- High-Performance Computing: MPI, OpenMP, GPU — The foundational tools for implementing these patterns
- Monte Carlo Methods — Embarrassingly parallel task: ideal strong scaling with zero communication
- Numerical Methods: Differential Equations — PDE on grids - the primary use case for domain decomposition
Вопросы для размышления
- Why is 2D domain decomposition more efficient than 1D for a 2D grid, even though both are correct?
- How does non-blocking MPI communication change the structure of an iterative PDE solver loop?
- Under what condition is weak scaling the more appropriate metric than strong scaling for evaluating an HPC program?