Parallel Computing
Distributed Shared Memory
Google TPU Pod v4 - 4096 TPU chips with 256GB HBM each - forms a single address space via the ICI interconnect. Training GPT-4 on such a cluster requires exactly the DSM abstraction: a tensor is split, its parts live on different chips, but the code sees a single array.
- **PyTorch DDP** - Distributed Data Parallel: each GPU holds a model copy, AllReduce synchronizes gradients after backward - eventual consistency in action
- **NUMA-aware ML** - TensorFlow and PyTorch have NUMA-aware allocators: data is placed close to the GPU/CPU that will process it
- **Apache Spark** - RDD (Resilient Distributed Dataset) is a DSM abstraction for big data: the programmer sees a single collection, the system manages partitions across the cluster
Consistency Models: A Spectrum of Guarantees
A consistency model is a contract between the programmer and the memory system. It defines the order in which write operations become visible to other processors. Stronger guarantees mean a slower system. This is a fundamental trade-off.
Sequential Consistency (Lamport, 1979) is the most widely used strong model. The result of parallel execution matches some sequential interleaving of operations. Java volatile and C++ std::atomic with memory_order_seq_cst provide SC guarantees.
PyTorch distributed training (DDP) uses eventual consistency for gradient aggregation via AllReduce. There are no strict consistency guarantees - and that is fine: the forward pass for the next iteration waits for AllReduce to finish. The consistency model is chosen to match the task semantics.
What does Sequential Consistency guarantee?
Distributed Shared Memory: A Unified Address Space
DSM is a system that gives the programmer the illusion of shared memory, even though data is physically distributed across different machines. A process accesses an address without knowing whether the data is local or across the network. It is a software illusion layered on distributed hardware.
Page-based DSM (IVY, TreadMarks): the unit of sharing is a memory page (usually 4KB). On a miss - page fault triggers a network request to the page owner, then data transfer. False sharing: two processes use different variables on the same page and constantly interfere with each other.
NUMA is modern mainstream: all server x86 and ARM CPUs are NUMA-aware. numactl and taskset allow pinning processes to NUMA nodes. Poor NUMA locality is a common cause of unexpected performance drops in ML training on multi-socket servers.
What is false sharing in the context of DSM?
Replication: Reliability Through Redundancy
Replication in DSM serves two purposes: fault tolerance (data survives node failure) and read throughput (multiple nodes can read simultaneously). But every write must propagate to all replicas - and this is where the CAP theorem applies.
Primary-backup: one node (primary) accepts writes and replicates to backups. If the primary fails, a backup becomes the new primary. Redis Sentinel and PostgreSQL streaming replication use this model. Simple scheme, but the primary is a write bottleneck.
Cassandra and DynamoDB use quorum replication without a single leader. Replication factor R, write quorum W, read quorum Q. When W + Q > R, overlap is guaranteed - at least one node in the read quorum has seen the latest write. Typical configuration: R=3, W=2, Q=2.
Minimum read quorum in Cassandra with R=5, W=3?
Coherence Protocols: MSI, MESI, MOESI
Cache coherence is the problem of keeping caches of different CPUs consistent. Without a coherence protocol: CPU 0 reads X=0 and caches it. CPU 1 writes X=1. CPU 0 reads X and gets stale 0 from its cache. This breaks program correctness.
The MESI protocol: a cache line has 4 states. Modified (dirty, only mine), Exclusive (clean, only mine), Shared (clean, held by multiple), Invalid (stale). On write: transition to Modified + send Invalidate to everyone with a Shared copy.
Directory-based coherence (vs snooping): with many CPUs, broadcasting every Invalidate is expensive. A directory tracks which nodes hold a copy of each cache line. Invalidate is sent only to those nodes. AMD EPYC and Intel Xeon Scalable use directory-based coherence for cross-NUMA traffic.
Distributed Shared Memory delivers the same performance as local memory
DSM is a software abstraction over a slow network; latency between nodes is 10-100x higher than local RAM
Hiding the network behind an abstraction is convenient for programming, but it does not change physics. Locality-aware programming (matching data placement to computation) is critical for DSM and NUMA system performance
What happens to a cache line in the Shared state when a write is attempted?
Related Topics
DSM connects memory architecture with distributed systems:
- Memory Ordering and Barriers — Physical foundation of consistency models
- MapReduce and Spark — Alternative approach: computation locality instead of memory sharing
- SIMD and Vectorization — NUMA-aware SIMD - the next level of optimization
Key Ideas
- **Consistency models** are contracts about write visibility order; stronger = slower, weaker = more scalable
- **DSM** provides the illusion of unified memory over distributed hardware; false sharing is a hidden performance killer at fine granularity
- **Replication** solves fault tolerance and read throughput; the CAP theorem limits what can be guaranteed simultaneously
- **MESI protocol** ensures cache coherence: Invalidate on write to Shared guarantees everyone sees the new value
Вопросы для размышления
- Why did Google move from MapReduce toward more memory-sharing approaches in newer systems (Dataflow, Flume)?
- How does the choice of consistency model affect the horizontal scaling potential of a distributed system?
- MESI works well for a small number of CPUs. Why are directory-based protocols necessary at hundreds of nodes?
Связанные уроки
- par-05 — Shared Memory is the foundation; DSM generalizes it to the network level
- par-10 — Memory ordering and barriers are the foundation of consistency models
- par-14 — MapReduce addresses a different aspect: computation locality, not memory
- par-16 — SIMD and NUMA are the physical implementation of memory hierarchy
- par-09 — Lock-free data structures implement part of consistency requirements
- os-07-memory
- dist-03-fallacies