Scientific Computing

Parallel Scientific Computing

Buying twice as many servers to get twice the speed? Not guaranteed. AlphaFold 2 from DeepMind achieved 80% GPU efficiency on 128 TPUs. Most scientific codes degrade to 30-40% at scale. The difference is in sound decomposition patterns, load balancing, and communication design.

**Climate model CESM:** 3D domain decomposition of atmosphere + ocean, 10,000+ MPI tasks, weak-scaling to ~100K cores
**DistributedDataParallel (PyTorch):** gradient AllReduce via NCCL, ring-allreduce algorithm for minimal communication overhead
**FLASH (astrophysics code):** Adaptive Mesh Refinement with dynamic balancing via Zoltan, strong-scaling to 300K cores

Domain Decomposition

The task: simulate Earth's atmosphere on a 1000x1000x100 cell grid. No single process has enough memory or time. **Domain Decomposition** cuts the spatial domain into pieces and assigns each piece to its own process. Physics within each piece is independent; interaction only occurs at boundaries.

**Ghost cells (halo cells):** each process stores 1-2 layers of cells from its neighbors. This allows computing finite differences at boundaries without extra neighbor queries during computation. Ghost cell exchange happens once per simulation step.

A 1000x1000 grid is split among 4 processes using 1D (strip) decomposition. How many boundary cells must be exchanged between P0 and P1 per step?

Load Balancing in HPC

Equal spatial slices do not mean equal work. In an N-body galaxy simulation: thousands of stars cluster at the center, the periphery is nearly empty. If the domain is split uniformly by coordinates, central processes work 100x longer than peripheral ones. All that waiting time is wasted. **Load balancing** distributes computational load, not data volume.

8 processes run a task. 7 finish in 10 seconds, one finishes in 30 seconds. What is the total parallel wall-clock time?

Communication Patterns: Scatter/Gather

Communication in MPI programs is not just send/recv. Different patterns have collective operations optimized at the network level: reduction trees, butterfly algorithms for AllReduce, pipelining. Knowing the patterns means choosing the right MPI call instead of building it manually.

**Non-blocking communication:** `MPI_Isend`/`MPI_Irecv` start a transfer and return immediately. While data travels over the network, the process computes the interior. `MPI_Wait` blocks until completion. This overlap of computation and communication is the key to high efficiency.

Which MPI operation best computes the sum of local residuals to check convergence, when all processes need the result?

Strong vs Weak Scaling

Adding 100 more servers - does the program run 100x faster? Almost never. Amdahl's Law explains why: the sequential portion of code places an absolute ceiling on speedup. There are two ways to evaluate a parallel program: **strong scaling** (same problem, more processes) and **weak scaling** (problem grows proportionally with process count).

**Roofline Model:** a diagnostic tool for bottleneck analysis. X-axis: arithmetic intensity (FLOP/byte), Y-axis: performance (GFLOPS). A program's working point is either memory-bound (left of the roof) or compute-bound (right). Memory-bound code benefits from better cache usage; compute-bound code benefits from SIMD/GPU.

A program has 5% sequential code (Amdahl's Law). What is the maximum achievable speedup regardless of processor count?