Real-Time Backend

Notification Fan-out

Instagram posts a photo by Kylie Jenner (398M followers). Within 30 seconds the system has to queue nearly 400 million notifications, and the API has to respond instantly. How does this work without collapse?

  • **Twitter/X:** hybrid fan-out - push for regular users (<10k followers), pull for celebrities. That is how it handles ~500M tweets per day without overloading storage.
  • **Meta (Facebook):** the Titan system processes >1B notifications per day via Kafka + Cassandra with multi-level batching. Each layer of the stack batches independently: events, DB writes, push tokens.
  • **Apple APNs:** accepts at most 300 requests/sec per token. Without throttling, large apps hit the rate limit immediately and lose deliverability for everyone, not just active users.
  • **SendGrid:** caps new IPs at 200 emails/day at the start. Violating the warm-up protocol lands you on provider spam lists, and recovering IP reputation takes weeks.

Fan-out to Millions of Recipients

When an Instagram celebrity with 50M followers posts a photo, the system has to create a notification for each of them. This is called **fan-out**: one event multiplied into millions of tasks. The naive approach (a loop at post time) will kill the server for sure. 50M INSERTs take hours while the user waits on the API.

Twitter ran into a fan-out problem with celebrities in 2013. A Barack Obama post triggered ~8M notifications, and the system lagged for minutes. The fix: a hybrid - fan-out on write for regular users, fan-out on read for celebrities (>10k followers).

The key insight is asynchrony. The post call returns 200 OK instantly, and fan-out runs in the background through a message queue (Kafka, SQS, RabbitMQ). The user does not wait for notifications to reach all 50M followers.

A user has 20 million followers. Which fan-out strategy do you pick for their posts?

Batch Processing of Notifications

When fan-out spawns millions of tasks, workers have to process them in batches. Otherwise the per-job overhead eats all the throughput. Batch processing groups identical operations to reduce latency and cost.

Batch size is a trade-off. Too small (10) and the per-call overhead dominates. Too large (100k) and one failure loses the whole batch, plus the transaction holds a lock too long. In practice: 500-1000 for DB inserts, 100-500 for push services (the FCM limit is 500 tokens per call).

Meta (Facebook) processes ~1B notifications per day. Their Titan system batches at every layer: grouping events before writing to Cassandra, grouping push tokens before sending to APNs/FCM, grouping email addresses before handing off to SendGrid.

FCM (Firebase Cloud Messaging) accepts a maximum of 500 tokens per multicast call. You need to push to 50,000 users. How many FCM calls do you need?

Priority Queue for Notifications

Not all notifications are equally urgent. A payment push at 3 AM is critical, a marketing blast is not. A priority queue guarantees delivery of important messages even when the system is overloaded, by pushing low-priority tasks back.

Slack uses three priority levels: critical (direct messages), normal (channel notifications), low (email digests). During a delivery system incident, Slack intentionally drops P2/P3 to guarantee P0/P1 delivery. Users notice the digest delay but do not lose important alerts.

In practice, priority queues are implemented either as separate physical queues (different Kafka topics, different SQS queues) or via a Redis Sorted Set score, where score is `priority * 1e13 - timestamp`. That guarantees processing by priority, with FIFO inside each priority.

The system is overloaded. Workers cannot keep up with the notification stream. Which notification should be processed first?

Notification Throttling

Throttling is the deliberate capping of notification send rate. It is not a bug, it is a feature: protection against spam, compliance with provider limits (APNs, FCM, SendGrid), and, most importantly, respect for the user. Getting 50 pushes per second is a recipe for app uninstall.

SendGrid auto-throttles senders: if an IP sends >1000 emails/min without warm-up, mail goes to spam or gets blocked. Proper IP ramp-up: start at 200 emails/day, double every 2 days. Amazon SES has a default limit of 14 emails/sec, which can be raised via a support request.

Throttling means losing data. A notification not sent right away is lost.

Throttling is flow control. Delayed or aggregated notifications are not lost - they are either deferred to the next window or collapsed into one message with a cumulative counter.

User experience matters more than instantaneous delivery. 200 separate pushes is worse than one '200 likes on your post'. Throttling is about delivery quality, not data loss.

A user got 200 likes on a post in an hour. What is the right way to throttle notifications?

Summary

  • **Fan-out strategy depends on audience**: push (on write) for regular users, pull (on read) for celebrities with millions of followers - the hybrid pattern is used by Instagram and Twitter.
  • **Batching cuts load by orders of magnitude**: 50,000 push notifications = 100 FCM calls at 500 tokens each instead of 50,000 separate HTTP requests.
  • **Throttling is user-friendly**: aggregating '200 likes' beats 200 separate pushes, and rate limiting protects IP reputation and prevents provider blocks.

Related Topics

Notification fan-out intersects with several core patterns from distributed systems:

  • Message Queues (Kafka, SQS) — The foundation of asynchronous fan-out - without queues, publishing would block until every notification finishes
  • Rate Limiting — Notification throttling is a specialized form of rate limiting, applied both to incoming events and to outgoing provider calls
  • Activity Feeds — Notification fan-out and activity feed fan-out solve a similar problem with different methods - covered in the next lesson

Вопросы для размышления

  • An app suddenly goes viral: one post collected 10M likes in an hour. Which parts of the notification system break first, and why?
  • How do you pick the right batch size for bulk INSERT in PostgreSQL? Which factors drive the optimal value?
  • A user complains: 'I get like notifications 2 hours late.' What could have gone wrong in the priority queue system?

Связанные уроки

  • sd-09-message-queue
Notification Fan-out

0

1

Sign In