Chat architecture

WhatsApp delivers 100 billion messages a day. Telegram supports groups of up to 200,000 people. Discord serves 19 million servers. What do their architectures have in common?

**WhatsApp** (100B messages/day) uses a deterministic channel key `min(id):max(id)` - no DB lookup on the first message
**Slack** introduced threads in 2017 and added `threadId` plus `parentId` - two fields instead of one to allow future nesting
**Discord** shards fan-out by server size: under 1,000 members - direct broadcast, larger - a Kafka pipeline
**Telegram** mega-groups (200k members) switch to a pull model when the chat opens instead of pushing to everyone

1:1 chat architecture

A 1:1 chat is the simplest topology, but it raises a non-trivial question: how do you identify the channel between two users? Creating a room on the first message means having up to N*(N-1)/2 potential rooms. WhatsApp and Telegram use a deterministic key: `min(userId_A, userId_B):max(userId_A, userId_B)`.

Telegram processes 15 billion messages per day. The key to its scalability is sharding by `channelId`: all messages for a single conversation live on the same shard, which guarantees data locality when paginating history.

Storing a channel lazily (on the first message) saves storage: a user with 1,000 contacts has up to 1,000 potential DM channels but actually uses 20-30. Facebook Messenger works exactly that way.

Users A (id=5) and B (id=3) start a conversation. The deterministic channel ID is...

Group chats: scaling delivery

Group chats break the simple 1:1 model. A Telegram group can hold 200,000 members. When a new message arrives it has to be delivered to every online member - this is the fan-out problem.

Discord uses a hybrid approach: up to 1,000 members - direct fan-out through internal pub/sub; above that - asynchronous queues via Elixir/Phoenix Channels. Across 19 million servers Discord processes 4 million messages per minute.

**Small groups (< 100)**: direct fan-out via WebSocket broadcast
**Medium groups (100-10k)**: Redis Pub/Sub with sharding across servers
**Large groups (> 10k)**: Kafka fan-out with batch delivery by workers
**Telegram mega-groups (> 100k)**: messages are not guaranteed to reach everyone, only online users plus pull on open

A Discord server with 50,000 members receives a new message. Why is direct fan-out over WebSocket not a fit?

Threads: nested conversations

Threads are replies tied to a specific message, forming a nested conversation. Slack introduced them in 2017, which significantly complicated the data model: every message can now be the root of a tree.

Slack stores 10+ billion messages. The key decision: `threadId` and `parentId` are two different fields. `threadId` points to the thread root (for group-by); `parentId` points to the immediate parent (for future nested-reply support). Slack itself uses only one level of nesting.

Denormalizing `replyCount` and `lastReplyAt` is critical for performance. Without them, every channel-list render would require an aggregating query across all thread messages.

When loading channel messages in a Slack-style app you only want to show root messages (not thread replies). Which WHERE clause is correct?

Reactions: real-time counters

Emoji reactions look simple, but they create hotspots: a popular Slack message can pick up hundreds of reactions per minute. The naive approach (UPDATE a counter on every reaction) destroys performance.

Slack uses a CRDT-like approach for reactions: every add or remove is a separate entry in an append-only log. The final counter is computed at read time. This handles concurrent reactions without locks.

Store `(messageId, userId, emoji)` as a unique row - one user cannot place the same reaction twice
Denormalize counters into `message.reactionSummary` JSONB - read from a single row
Use atomic UPDATE with JSONB functions to avoid race conditions
Throttle reaction broadcasts: if 10 reactions arrive within 100 ms, send a single batch

Reactions are just counters, INCREMENT/DECREMENT in a single column

Reactions need a separate table with uniqueness on (messageId, userId, emoji); the counter is a read-side denormalization

Without a separate table you cannot (1) show who reacted, (2) guarantee that one user does not place the same reaction twice, (3) atomically undo a reaction

Why use a separate `(messageId, userId, emoji)` table for reactions instead of a simple counter on the message?

Takeaways

**Deterministic channel ID**: `min(A,B):max(A,B)` - no need to create a channel up front, the key is computed on the fly
**Fan-out strategy depends on size**: small groups - direct broadcast, large - Kafka plus workers
**Reactions = separate table**: the counter is denormalized; the source of truth is `(messageId, userId, emoji)`

Вопросы для размышления

How does the architecture change when you add support for forwarding messages between channels?
Telegram channels (not groups) have millions of subscribers. How does their delivery differ from group chats?
Reactions on a message with 1M views attract 10k reactions per minute. How do you protect the database?

Связанные уроки

sd-14-twitter