Real-Time Backend
Design: Notification Platform
One Facebook notification service outage in 2021 delayed 500 million messages by 40 minutes. How do you design a system that has no right to fall over?
- Slack sends more than 500M notifications a day across Push, Email and In-App, all going through a single Notification Platform with centralized routing and deduplication
- Uber saves millions of dollars a year by routing non-critical notifications from SMS to Push. A unified platform lets the channel change without touching product services
- Airbnb spotted APNs delivery rate dropping from 88% to 71% in 10 minutes thanks to real-time ClickHouse analytics and rolled back the change before mass user impact
- Facebook split the notification queue into three priorities, critical security events get delivered in under 1 sec even when the low-priority queue is jammed with millions of social alerts
Notification platform architecture
A Notification Platform is not just 'send an email'. It is a distributed system that takes in events from dozens of services every second, routes them across channels, deduplicates, rate-limits and guarantees delivery. Slack handles more than **500 million notifications a day**, and none of them can be lost or duplicated.
Key components
- **Event Ingestion Layer**: takes events from product services through Kafka/Pub-Sub, buffers load spikes
- **Notification Service**: stateless workers that apply business logic: who, which channel, when
- **Channel Adapters**: isolated adapters per channel (Push, Email, SMS, In-App), each with its own retry strategy
- **Preference Store**: a Redis cache on top of PostgreSQL that holds user settings per channel
- **Dedup Cache**: a Redis SET with TTL that prevents the same notification from being sent twice on retry
Airbnb moved to a dedicated Notification Service in 2019. Before that, each of 200+ microservices called SendGrid on its own. The result: 8 different templates for 'Booking confirmed', no unified analytics, no way to disable a channel globally. Centralization removed all of this in one architectural iteration.
Why does a Notification Platform need a separate Dedup Cache if Kafka guarantees at-least-once delivery?
Multi-channel delivery
Each channel has its own guarantees, limits and cost. A push notification through APNs costs $0, an SMS through Twilio costs $0.0075 per message. Uber spends **$15M to $20M a year** on SMS verification alone, so every channel has to be isolated and optimized on its own.
Channel characteristics
- **Push (APNs/FCM)**: latency under 1 sec, delivery rate around 85% (offline devices), free, fits real-time alerts
- **Email (SendGrid/SES)**: latency 1 to 60 sec, delivery rate around 98%, cost $0.0001/email, fits digests and transactional messages
- **SMS (Twilio/Vonage)**: latency under 5 sec, delivery rate around 99%, $0.0075/SMS, only for critical (OTP, safety alerts)
- **In-App**: delivery rate 100% (user is online), zero cost, but only works when the app is open
- **WebPush**: works in the browser when the tab is closed, delivery rate around 60%, free
Slack uses a **fanout strategy**: one 'new message in channel' event explodes into N notifications for every participant, each routed to the right channel adapter. With 10 million users online at the same time that produces billions of fanout operations an hour, which is why fanout is broken out into a separate, horizontally scalable service.
Why does the SMS adapter use a retryPolicy with attempts=1 instead of exponential backoff like Push?
Prioritization and rate limiting
Without prioritization a push about a 'photo like' clogs the queue and delays a 'transaction declined' alert. Facebook split notifications into **three classes**: critical (payments, security), high (direct messages), low (social activity). Under peak load the low class is throttled to zero.
Token Bucket for per-user rate limiting
Priority queues are implemented through **separate Kafka topics** per priority, not through one topic with a priority field. Kafka does not support per-message priority inside a topic, the consumer reads offsets sequentially. Three topics: `notif.critical`, `notif.high`, `notif.low`. Consumers on the critical topic get more worker threads.
- **Global per-channel rate limit**: APNs 300 req/s per DeviceToken, FCM 600 req/s per app, protection from provider blocking
- **Per-user per-channel limit**: no more than 5 SMS/hour, 20 push/hour, protects the user from spam
- **Burst allowance**: critical events can use burst capacity (2x the limit for 60 seconds) during security incidents
- **Quiet hours**: low/high notifications are not sent from 23:00 to 08:00 in the user's local time zone, critical bypasses this
Why use three separate Kafka topics for priority routing instead of a single topic with a `priority: 'high'` field?
Delivery analytics
Sending a notification is half the job. Knowing what happened next is the other half. Notification analytics answers: what percentage of push gets opened, which channel has the highest delivery rate, for which user segments does SMS work better than email. Without this you cannot optimize either cost or UX.
Key metrics
- **Delivery Rate**: percentage of notifications confirmed by the provider (APNs returned 200, not 'device delivered'). Norm: Push 85 to 90%, Email 95 to 98%, SMS 99%+
- **Open Rate**: percentage of notifications the user clicked. Norm: Push 5 to 10%, Email 20 to 30% (varies by category)
- **Opt-out Rate**: percentage of users who turned off a channel after N notifications. Growth above 2%/month signals spam
- **P99 Delivery Latency**: time from event to delivery. Critical: under 1 sec, High: under 5 sec, Low: under 60 sec
Airbnb builds analytics on **ClickHouse**. Every lifecycle event of a notification (queued, sent, delivered, opened, failed) is written to an append-only table. The dashboard shows delivery rate in real time. When Apple tightened APNs rules in 2022, Airbnb saw delivery rate drop from 88% to 71% in 10 minutes and rolled back the change before mass user impact.
It is enough to log the fact of sending. If the provider returned 200, the user received it
A 200 from APNs/FCM means 'accepted into the queue', not 'delivered to the device'. Delivery rate and open rate are separate metrics that require callback mechanisms and client-side tracking
APNs returns 200 even if the device is off or the app is uninstalled. Real delivery is confirmed only by the feedback service or a read receipt from the client. Facebook discovered 12% of users have stale push tokens. Without analytics they would be sending millions of 'successful' notifications into the void.
Why use ClickHouse for notification analytics rather than the same PostgreSQL that holds the preference store?
Takeaways
- Notification Platform = Event Ingestion + stateless Notification Service + isolated Channel Adapters + Preference Store + Dedup Cache. Centralization removes duplicated logic and gives unified analytics
- Each channel (Push, Email, SMS, In-App) lives in a separate adapter with its own retry strategy: SMS uses attempts=1 (the provider retries on its own), Push uses exponential backoff up to 3 attempts
- Prioritization runs through separate Kafka topics per priority, not through a field in one topic. Kafka's per-partition sequential order does not let you skip ahead to an important message in the middle of the queue
- Analytics is built on event sourcing: every state (queued, sent, delivered, opened, failed) is written to ClickHouse. Delivery rate is not the same as sent rate. Real delivery is confirmed only by feedback/read receipt
Related topics
Notification Platform leans on several core distributed-systems patterns:
- Message Queues & Kafka — The Event Ingestion Layer and priority queues are built on Kafka. Understanding partitions, offsets and consumer groups is a required base
- Rate Limiting (Token Bucket) — Per-user per-channel rate limiting is implemented through Token Bucket in Redis. The same algorithm is used in API Gateway and DDoS protection
- Idempotency & Deduplication — Dedup Cache guards against duplicates under Kafka's at-least-once delivery. The idempotency key pattern applies to every distributed system with retries
Вопросы для размышления
- How does the architecture change if you need to support scheduled notifications (send in 24 hours)? Which component gets added and how does it interact with Kafka?
- Under peak load (Black Friday at an e-commerce site) the low-priority queue grows faster than it drains. Which strategies help: shed load, adaptive throttling or autoscaling workers? What are the trade-offs of each?
- A user complains: 'I get the same notification twice'. Dedup Cache is in place. What are the three most likely causes of duplication and how do you diagnose them?