Programming Language Theory

Memory Models and Execution Order: SC, TSO, and Weak Consistency

June 1, 2012. Knight Capital Group. A new trading system was deployed with a memory ordering bug in a multithreaded hot path. In 45 minutes, the firm executed 4 million erroneous trades and lost USD 440 million. The bug was not a logic error. The algorithm was correct under Sequential Consistency. The hardware was not running SC.

  • **LMAX Disruptor** (financial exchange): processes 6 million transactions per second using acquire-release fences instead of locks. A single misplaced memory_order_relaxed corrupts the ordering guarantees and produces incorrect trade execution.
  • **Linux kernel** implements RCU (Read-Copy-Update) using carefully placed memory barriers for each architecture. The ARM and x86 barrier sequences differ because TSO and Relaxed have different cost profiles.
  • **PyTorch distributed training** (DDP): gradient all-reduce uses NCCL, which inserts CUDA memory fences between compute kernels. Removing them produces silently diverged model weights across workers - the loss converges to different values on each GPU.

Sequential Consistency: The Illusion Programmers Want

1979. Leslie Lamport defines Sequential Consistency (SC): the result of any execution is the same as if all operations were executed in some sequential order, and the operations of each processor appear in that sequence in the order they were issued. One clean sentence. The entire history of concurrency bugs is a footnote to that sentence.

SC is what every programmer implicitly assumes when reading multithreaded code. Thread A writes x=1, thread B reads x - SC says B sees 1 if the write happened before the read in wall-clock time. The catch: modern CPUs do not guarantee SC. Not even close. They guarantee something weaker, and the gap between the two is where memory corruption, lost updates, and broken locks live.

Java's volatile and C++'s std::atomic<T> (with memory_order_seq_cst) restore SC semantics for specific variables at the cost of hardware fences. PyTorch's distributed primitives rely on SC semantics for gradient synchronization - breaking it causes silent model divergence across training replicas.

Under Sequential Consistency, which outcome is impossible for the store-load test (Thread 1: x=1; r1=y; / Thread 2: y=1; r2=x;)?

TSO and Relaxed: What Hardware Actually Does

x86 uses Total Store Order (TSO). The key hardware device: a per-core store buffer. When a core writes, the value goes into the store buffer first - not directly to the shared cache. Reads bypass the store buffer and go straight to cache. Result: a core can read a stale value from cache while its own pending write sits in the buffer. Store-Load reordering appears.

ARM and POWER use Relaxed models - the most aggressive. All four reorderings are possible: Store-Load, Store-Store, Load-Load, Load-Store. The hardware has explicit message-passing semantics internally: each core has its own view of memory and must explicitly synchronize. This is why porting lock-free code from x86 to ARM has historically been a source of crashes.

Go's runtime uses platform-specific fence sequences in its goroutine scheduler. When a Go channel send/receive triggers a context switch, the scheduler inserts the right fence for the current architecture. The Go memory model (revised 2022) follows the DRF-SC guarantee: data-race-free programs behave as if SC. Races produce undefined behavior - same as C++.

x86 TSO (Total Store Order) allows which reordering that SC does not?

Language Memory Models: Happens-Before, Atomics, and Data Races

Java 5 (2004) and C++11 (2011) introduced formal language-level memory models. The core abstraction: happens-before. Operation A happens-before B if: A and B are in the same thread and A precedes B in program order, OR A is a volatile/atomic write and B is a read of the same variable that observes A's value. Happens-before is transitive. If A hb B and B hb C, then A hb C.

A data race: two threads access the same memory location concurrently, at least one writes, and no happens-before relation orders them. Under Java and C++: a data race produces undefined behavior. Not 'might give wrong result' - undefined. The compiler is free to assume races do not exist, and this assumption powers optimizations that would otherwise be incorrect.

Acquire-release semantics are the minimum needed for correct mutexes, channels, and lock-free data structures. A mutex lock is an acquire; unlock is a release. All writes done before unlock are visible to the next thread that acquires. This is the guarantee your mutex gives. Everything else is undefined.

LMAX Disruptor (financial exchange, processes 6 million transactions/second) uses careful acquire-release ordering between producers and consumers instead of locks. The Java implementation uses Unsafe.putOrderedLong (release semantics) for writes and a volatile read for the sequence number. Removing a single fence drops throughput by 40% and produces incorrect results. Memory models are not academic - they are the difference between a working exchange and a corrupt one.

Using volatile in C++ makes concurrent access safe

C++ volatile prevents compiler reordering of accesses but provides no hardware fences and no atomic guarantees. It is meant for memory-mapped I/O, not for threading. Use std::atomic instead.

Java volatile does provide seq_cst semantics - making the C++ confusion worse. In C++, volatile and atomic are completely orthogonal. A C++ volatile variable can still be torn (non-atomic read-modify-write) and the CPU can still reorder accesses around it.

What is a data race according to the C++11 memory model?

Related Topics

Memory models connect hardware architecture, language design, and formal semantics:

  • Hoare Logic — Single-threaded correctness foundation that memory models extend to concurrent programs
  • Cache Coherence — Hardware substrate - MESI protocol - that determines what orderings are possible
  • Vector Clocks — Distributed-systems analogue of happens-before for ordering events across nodes
  • CSP Concurrency — Alternative model that avoids shared state and memory model complexity

Итоги

  • **Sequential Consistency** (Lamport 1979): all operations appear in one global order matching each thread's program order. Hardware does not guarantee this.
  • **TSO (x86)**: store buffers introduce Store-Load reordering only. Store-Store, Load-Load, Load-Store are preserved. MFENCE restores SC at a cost.
  • **Relaxed (ARM/POWER)**: all four reorderings possible. Explicit DMB/DSB fences required. Porting lock-free code from x86 to ARM without adding fences is a crash waiting to happen.
  • **Happens-before**: the language-level abstraction. Volatile/atomic writes and reads, mutex lock/unlock, and thread start/join create happens-before edges. A data race = concurrent conflicting accesses with no happens-before. Under C++/Java: data race = undefined behavior.
  • **Acquire-Release**: the minimum fence pair for correct synchronization. Cheaper than seq_cst; sufficient for mutexes, channels, and most lock-free structures.

Вопросы для размышления

  • The Java memory model (JMM) allows 'causality violations' in its formal spec - a write can appear to cause itself. Is this a theoretical curiosity or does it affect real JVM implementations?
  • Rust's type system prevents data races at compile time via ownership and Send/Sync traits. Does this make the Rust memory model simpler than C++? What guarantees remain the programmer's responsibility?
  • DRF-SC (Data-Race-Free programs execute as SC) is the implicit contract most programmers rely on. What happens when lock-free algorithms are correct under SC but broken under TSO - and how does one systematically find such bugs?

Связанные уроки

  • plt-16-axiomatic-semantics — Hoare Logic reasons about single-threaded state; memory models generalize this to concurrent execution
  • plt-23-csp — CSP message-passing is one answer to the pitfalls exposed by weak memory models
  • arch-14-multicore — Cache coherence protocols are the hardware substrate that memory models must account for
  • dist-05-vector-clocks — Vector clocks and happens-before are the distributed-systems version of the same ordering problem
  • ds-04-clocks — Lamport clocks formalize event ordering - the same idea as memory-model happens-before
  • dist-03-fallacies
Memory Models and Execution Order: SC, TSO, and Weak Consistency

0

1

Sign In