Parallel Computing

Distributed Shared Memory

Google TPU Pod v4 - 4096 TPU chips with 256GB HBM each - forms a single address space via the ICI interconnect. Training GPT-4 on such a cluster requires exactly the DSM abstraction: a tensor is split, its parts live on different chips, but the code sees a single array.

**PyTorch DDP** - Distributed Data Parallel: each GPU holds a model copy, AllReduce synchronizes gradients after backward - eventual consistency in action
**NUMA-aware ML** - TensorFlow and PyTorch have NUMA-aware allocators: data is placed close to the GPU/CPU that will process it
**Apache Spark** - RDD (Resilient Distributed Dataset) is a DSM abstraction for big data: the programmer sees a single collection, the system manages partitions across the cluster

Consistency Models: A Spectrum of Guarantees

A consistency model is a contract between the programmer and the memory system. It defines the order in which write operations become visible to other processors. Stronger guarantees mean a slower system. This is a fundamental trade-off.

Sequential Consistency (Lamport, 1979) is the most widely used strong model. The result of parallel execution matches some sequential interleaving of operations. Java volatile and C++ std::atomic with memory_order_seq_cst provide SC guarantees.

PyTorch distributed training (DDP) uses eventual consistency for gradient aggregation via AllReduce. There are no strict consistency guarantees - and that is fine: the forward pass for the next iteration waits for AllReduce to finish. The consistency model is chosen to match the task semantics.

What does Sequential Consistency guarantee?

Distributed Shared Memory: A Unified Address Space

DSM is a system that gives the programmer the illusion of shared memory, even though data is physically distributed across different machines. A process accesses an address without knowing whether the data is local or across the network. It is a software illusion layered on distributed hardware.

Page-based DSM (IVY, TreadMarks): the unit of sharing is a memory page (usually 4KB). On a miss - page fault triggers a network request to the page owner, then data transfer. False sharing: two processes use different variables on the same page and constantly interfere with each other.

NUMA is modern mainstream: all server x86 and ARM CPUs are NUMA-aware. numactl and taskset allow pinning processes to NUMA nodes. Poor NUMA locality is a common cause of unexpected performance drops in ML training on multi-socket servers.

What is false sharing in the context of DSM?

Replication: Reliability Through Redundancy

Replication in DSM serves two purposes: fault tolerance (data survives node failure) and read throughput (multiple nodes can read simultaneously). But every write must propagate to all replicas - and this is where the CAP theorem applies.

Primary-backup: one node (primary) accepts writes and replicates to backups. If the primary fails, a backup becomes the new primary. Redis Sentinel and PostgreSQL streaming replication use this model. Simple scheme, but the primary is a write bottleneck.

Cassandra and DynamoDB use quorum replication without a single leader. Replication factor R, write quorum W, read quorum Q. When W + Q > R, overlap is guaranteed - at least one node in the read quorum has seen the latest write. Typical configuration: R=3, W=2, Q=2.

Minimum read quorum in Cassandra with R=5, W=3?

Coherence Protocols: MSI, MESI, MOESI

Cache coherence is the problem of keeping caches of different CPUs consistent. Without a coherence protocol: CPU 0 reads X=0 and caches it. CPU 1 writes X=1. CPU 0 reads X and gets stale 0 from its cache. This breaks program correctness.

The MESI protocol: a cache line has 4 states. Modified (dirty, only mine), Exclusive (clean, only mine), Shared (clean, held by multiple), Invalid (stale). On write: transition to Modified + send Invalidate to everyone with a Shared copy.

Directory-based coherence (vs snooping): with many CPUs, broadcasting every Invalidate is expensive. A directory tracks which nodes hold a copy of each cache line. Invalidate is sent only to those nodes. AMD EPYC and Intel Xeon Scalable use directory-based coherence for cross-NUMA traffic.

Distributed Shared Memory delivers the same performance as local memory

DSM is a software abstraction over a slow network; latency between nodes is 10-100x higher than local RAM

Hiding the network behind an abstraction is convenient for programming, but it does not change physics. Locality-aware programming (matching data placement to computation) is critical for DSM and NUMA system performance

What happens to a cache line in the Shared state when a write is attempted?

Key Ideas

**Consistency models** are contracts about write visibility order; stronger = slower, weaker = more scalable
**DSM** provides the illusion of unified memory over distributed hardware; false sharing is a hidden performance killer at fine granularity
**Replication** solves fault tolerance and read throughput; the CAP theorem limits what can be guaranteed simultaneously
**MESI protocol** ensures cache coherence: Invalidate on write to Shared guarantees everyone sees the new value

Вопросы для размышления

Why did Google move from MapReduce toward more memory-sharing approaches in newer systems (Dataflow, Flume)?
How does the choice of consistency model affect the horizontal scaling potential of a distributed system?
MESI works well for a small number of CPUs. Why are directory-based protocols necessary at hundreds of nodes?

Связанные уроки

par-05 — Shared Memory is the foundation; DSM generalizes it to the network level
par-10 — Memory ordering and barriers are the foundation of consistency models
par-14 — MapReduce addresses a different aspect: computation locality, not memory
par-16 — SIMD and NUMA are the physical implementation of memory hierarchy
par-09 — Lock-free data structures implement part of consistency requirements
os-07-memory
dist-03-fallacies

Parallel Computing

Distributed Shared Memory

**PyTorch DDP** - Distributed Data Parallel: each GPU holds a model copy, AllReduce synchronizes gradients after backward - eventual consistency in action
**NUMA-aware ML** - TensorFlow and PyTorch have NUMA-aware allocators: data is placed close to the GPU/CPU that will process it
**Apache Spark** - RDD (Resilient Distributed Dataset) is a DSM abstraction for big data: the programmer sees a single collection, the system manages partitions across the cluster

Consistency Models: A Spectrum of Guarantees

What does Sequential Consistency guarantee?

Distributed Shared Memory: A Unified Address Space

What is false sharing in the context of DSM?

Replication: Reliability Through Redundancy

Minimum read quorum in Cassandra with R=5, W=3?

Coherence Protocols: MSI, MESI, MOESI

Distributed Shared Memory delivers the same performance as local memory

DSM is a software abstraction over a slow network; latency between nodes is 10-100x higher than local RAM

What happens to a cache line in the Shared state when a write is attempted?

Key Ideas

**Consistency models** are contracts about write visibility order; stronger = slower, weaker = more scalable
**DSM** provides the illusion of unified memory over distributed hardware; false sharing is a hidden performance killer at fine granularity
**Replication** solves fault tolerance and read throughput; the CAP theorem limits what can be guaranteed simultaneously
**MESI protocol** ensures cache coherence: Invalidate on write to Shared guarantees everyone sees the new value

Вопросы для размышления

Why did Google move from MapReduce toward more memory-sharing approaches in newer systems (Dataflow, Flume)?
How does the choice of consistency model affect the horizontal scaling potential of a distributed system?
MESI works well for a small number of CPUs. Why are directory-based protocols necessary at hundreds of nodes?

Связанные уроки

par-05 — Shared Memory is the foundation; DSM generalizes it to the network level
par-10 — Memory ordering and barriers are the foundation of consistency models
par-14 — MapReduce addresses a different aspect: computation locality, not memory
par-16 — SIMD and NUMA are the physical implementation of memory hierarchy
par-09 — Lock-free data structures implement part of consistency requirements
os-07-memory
dist-03-fallacies

Distributed Shared Memory

Consistency Models: A Spectrum of Guarantees

Distributed Shared Memory: A Unified Address Space

Replication: Reliability Through Redundancy

Coherence Protocols: MSI, MESI, MOESI

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки

Distributed Shared Memory

Consistency Models: A Spectrum of Guarantees

Distributed Shared Memory: A Unified Address Space

Replication: Reliability Through Redundancy

Coherence Protocols: MSI, MESI, MOESI

Related Topics

Key Ideas

Вопросы для размышления

Связанные уроки