Real-Time Backend

Message Ordering

In a distributed system there is no time machine. An event that 'happened first' on one node can arrive later at another. Lamport clocks, vector clocks, and causal ordering are the tools for reasoning about time where shared time does not exist.

  • Cassandra uses Lamport timestamps (client-side microseconds) for last-write-wins conflict resolution. Clock skew between clients can lead to losing fresh data
  • DynamoDB (the original Dynamo) stores a per-key vector clock and returns concurrent versions to the client for semantic reconciliation instead of auto-resolving conflicts
  • Kafka guarantees FIFO ordering within a partition via a single partition leader, with no cross-partition ordering. A deliberate trade-off for scalability
  • Slack maintains per-channel sequence numbers to order messages and thread replies, guaranteeing causal order: the reply is always delivered after the original
  • Google Spanner uses TrueTime with atomic clocks and GPS to guarantee external consistency without vector clocks, at the cost of about 7ms wait per commit

Ordering Problem

A distributed system has no shared clock. Each node lives on its own time, and two messages sent 'simultaneously' can arrive in opposite order on different nodes. This isn't a bug in a specific implementation. It is a physical property of systems without shared memory.

Slack: a deletion arrives before the original message

User A types 'hello' and then deletes it. User B receives 'delete' before the message itself: the network delivered packets out of order. Slack maintains a per-channel sequence number so the client can buffer and reorder events before rendering.

There are three levels of ordering guarantees, each more expensive than the last:

  1. **FIFO order**: messages from one sender arrive in send order. The cheapest level.
  2. **Causal order**: if event B causally depends on A, then B is never delivered before A. Mid-tier.
  3. **Total order**: every node sees every message in the same order. Requires consensus, expensive.

Kafka gives FIFO within a partition for free. The partition key locks in the order for one producer. Between partitions there is no order. This is a deliberate trade-off: total order across the entire topic would kill horizontal scaling.

A Kafka topic has 3 partitions. The producer sends m1 to P0, then m2 to P1. The consumer reads both partitions. Which statement is true?

Lamport Clocks

Leslie Lamport proposed a simple idea in 1978: instead of physical time, use logical counters. Each node holds a monotonically growing counter. On send, the counter is incremented and embedded in the message. On receive, the counter becomes max(local, received) + 1.

Key property: if A happened-before B (A → B), then timestamp(A) < timestamp(B). The converse is not true: equal or increasing timestamps do not imply causal connection.

Cassandra: last-write-wins via Lamport timestamps

Cassandra uses client-supplied timestamps (microseconds) as a Lamport clock for conflict resolution. If two nodes received different values for one key, the write with the larger timestamp wins. The catch: client clock skew. If client A's clock is 1 second fast, its 'stale' write overwrites client B's fresh write.

Google Spanner solves clock skew differently: the TrueTime API returns an interval [earliest, latest] with bounded error < 7ms (atomic clocks + GPS). Spanner waits until intervals stop overlapping before committing a transaction. This delivers external consistency, which Lamport clocks cannot.

  • **Plus**: minimal overhead, one uint64 per message
  • **Plus**: partial order is deterministic and simple to implement
  • **Minus**: concurrent events are indistinguishable. Two events with different timestamps can still be parallel
  • **Minus**: causality cannot be inferred from timestamps alone without extra context

Node A sent a message with Lamport timestamp 10. Node B receives it with a local counter at 15. What is B's counter after receive?

Vector Clocks

Lamport clocks can't tell 'A happened before B' from 'A and B are independent'. Vector clocks fix that: each node keeps a vector of counters, one per node in the system. This makes causality detectable and identifies concurrent events.

DynamoDB: vector clock per-key versioning

DynamoDB (in the original Amazon Dynamo paper) uses vector clocks to track object versions. Each update to a key adds a (server, counter) entry to the version vector. On conflict (two incomparable vectors, both concurrent) DynamoDB returns both versions to the client. The client must resolve the conflict and write back the merged version. This is semantic reconciliation: the system does not impose a resolution policy.

The downside of vector clocks: size grows with node count. The Amazon Dynamo paper describes an optimization, clock truncation: limit the vector to the N most recent entries and drop the rest. This can lose causality info, but works well in practice for most access patterns.

  • Lamport Clock — One uint64. Gives partial order, but doesn't distinguish concurrent from causal. O(1) overhead per message.
  • Vector Clock — A vector of N uint64 (N = node count). Precisely identifies causality and concurrent events. O(N) overhead per message.

Node A's vector clock: {A:3, B:2, C:1}. Node B's vector clock: {A:2, B:3, C:1}. What is the relation between these events?

Causal Order

Causal ordering is the guarantee that if event B causally depends on A (the user saw A before sending B), then every node receives A before B. It is weaker than total order (doesn't require identical order for independent events) and stronger than FIFO (tracks dependencies across senders).

Slack: threading via causal order

When a user replies in a thread, Slack attaches the original message's sequence number to the reply. The server guarantees that the thread reply is only delivered after the parent message. Without this, the client could show a reply with no original, a classic artifact of broken causal order.

Causal broadcast via vector clocks: a message m with vector clock VC(m) from node i is delivered to node j only when:

  1. VC(m)[i] = VC(j)[i] + 1, this is the next message from i
  2. VC(m)[k] <= VC(j)[k] for all k != i, every dependency is already delivered

Causal consistency is one of the points in the PACELC theorem. Cassandra offers eventual consistency by default and tunable consistency per request. Causal consistency in Cassandra is achieved via lightweight transactions (LWT) on Paxos. Expensive, but sometimes necessary.

Causal order and total order are the same thing, total order just costs more

They are qualitatively different guarantees. Causal order only orders causally-related events. Total order requires global agreement on order even for independent events.

Under causal order, two independent events can be delivered in different order to different nodes, and that is fine when there is no causal link between them. Total order forbids this: every node must see the same order for every message. Kafka achieves total order within a partition via a single leader broker; Apache ZooKeeper uses ZAB (Zookeeper Atomic Broadcast) for total order of writes.

A system uses causal broadcast. Node C received message m2 with VC={A:1, B:1} from B but has not yet received m1 with VC={A:1, B:0} from A. What does C do?

Key ideas

  • Physical time is unreliable in distributed systems. Clock skew, network delays, and NTP error make 'simultaneity' undefined
  • Lamport clocks deliver partial order with O(1) overhead: if A → B, then TS(A) < TS(B). The converse does not hold. Causality cannot be inferred from timestamps alone
  • Vector clocks precisely identify causality and concurrent events at O(N) overhead. Each node holds a vector of counters sized to the node count
  • Causal broadcast buffers messages until every dependency is delivered. This gives causal order without total-order overhead
  • Choice of guarantee is a trade-off: FIFO is cheap, causal order needs buffering and vector clocks, total order requires consensus (Paxos/Raft)

Related topics

Message ordering intersects with several core distributed-systems topics:

  • Consensus (Paxos/Raft) — Total order requires consensus. Raft log replication guarantees identical record order across all replicas
  • Eventual Consistency — Eventually consistent systems (Cassandra, DynamoDB) use Lamport timestamps or vector clocks to resolve conflicts without a coordinator
  • Event Sourcing — An event log needs total order for deterministic state reconstruction, which is why event-sourced systems often use a single writer or sequence numbers

Вопросы для размышления

  • Cassandra lets clients set the timestamp manually. What abuse scenarios does this open up, and how can they be mitigated?
  • Picture a chat app. What level of ordering guarantee do you need: FIFO, causal, or total? Does the answer change for direct messages vs group chats?
  • DynamoDB returns conflicting versions to the client. Name three different semantic reconciliation strategies and the situations where each fits.

Связанные уроки

  • db-04-cap
Message Ordering

0

1

Sign In