Real-Time Backend

Design: Notification Platform

One Facebook notification service outage in 2021 delayed 500 million messages by 40 minutes. How do you design a system that has no right to fall over?

Slack sends more than 500M notifications a day across Push, Email and In-App, all going through a single Notification Platform with centralized routing and deduplication
Uber saves millions of dollars a year by routing non-critical notifications from SMS to Push. A unified platform lets the channel change without touching product services
Airbnb spotted APNs delivery rate dropping from 88% to 71% in 10 minutes thanks to real-time ClickHouse analytics and rolled back the change before mass user impact
Facebook split the notification queue into three priorities, critical security events get delivered in under 1 sec even when the low-priority queue is jammed with millions of social alerts

Notification platform architecture

A Notification Platform is not just 'send an email'. It is a distributed system that takes in events from dozens of services every second, routes them across channels, deduplicates, rate-limits and guarantees delivery. Slack handles more than **500 million notifications a day**, and none of them can be lost or duplicated.

Key components

**Event Ingestion Layer**: takes events from product services through Kafka/Pub-Sub, buffers load spikes
**Notification Service**: stateless workers that apply business logic: who, which channel, when
**Channel Adapters**: isolated adapters per channel (Push, Email, SMS, In-App), each with its own retry strategy
**Preference Store**: a Redis cache on top of PostgreSQL that holds user settings per channel
**Dedup Cache**: a Redis SET with TTL that prevents the same notification from being sent twice on retry

Airbnb moved to a dedicated Notification Service in 2019. Before that, each of 200+ microservices called SendGrid on its own. The result: 8 different templates for 'Booking confirmed', no unified analytics, no way to disable a channel globally. Centralization removed all of this in one architectural iteration.

Why does a Notification Platform need a separate Dedup Cache if Kafka guarantees at-least-once delivery?

Multi-channel delivery

Each channel has its own guarantees, limits and cost. A push notification through APNs costs $0, an SMS through Twilio costs $0.0075 per message. Uber spends **$15M to $20M a year** on SMS verification alone, so every channel has to be isolated and optimized on its own.

Channel characteristics

**Push (APNs/FCM)**: latency under 1 sec, delivery rate around 85% (offline devices), free, fits real-time alerts
**Email (SendGrid/SES)**: latency 1 to 60 sec, delivery rate around 98%, cost $0.0001/email, fits digests and transactional messages
**SMS (Twilio/Vonage)**: latency under 5 sec, delivery rate around 99%, $0.0075/SMS, only for critical (OTP, safety alerts)
**In-App**: delivery rate 100% (user is online), zero cost, but only works when the app is open
**WebPush**: works in the browser when the tab is closed, delivery rate around 60%, free

Slack uses a **fanout strategy**: one 'new message in channel' event explodes into N notifications for every participant, each routed to the right channel adapter. With 10 million users online at the same time that produces billions of fanout operations an hour, which is why fanout is broken out into a separate, horizontally scalable service.

Why does the SMS adapter use a retryPolicy with attempts=1 instead of exponential backoff like Push?

Prioritization and rate limiting

Without prioritization a push about a 'photo like' clogs the queue and delays a 'transaction declined' alert. Facebook split notifications into **three classes**: critical (payments, security), high (direct messages), low (social activity). Under peak load the low class is throttled to zero.

Token Bucket for per-user rate limiting

Priority queues are implemented through **separate Kafka topics** per priority, not through one topic with a priority field. Kafka does not support per-message priority inside a topic, the consumer reads offsets sequentially. Three topics: `notif.critical`, `notif.high`, `notif.low`. Consumers on the critical topic get more worker threads.

**Global per-channel rate limit**: APNs 300 req/s per DeviceToken, FCM 600 req/s per app, protection from provider blocking
**Per-user per-channel limit**: no more than 5 SMS/hour, 20 push/hour, protects the user from spam
**Burst allowance**: critical events can use burst capacity (2x the limit for 60 seconds) during security incidents
**Quiet hours**: low/high notifications are not sent from 23:00 to 08:00 in the user's local time zone, critical bypasses this

Why use three separate Kafka topics for priority routing instead of a single topic with a `priority: 'high'` field?

Delivery analytics

Sending a notification is half the job. Knowing what happened next is the other half. Notification analytics answers: what percentage of push gets opened, which channel has the highest delivery rate, for which user segments does SMS work better than email. Without this you cannot optimize either cost or UX.

Key metrics

**Delivery Rate**: percentage of notifications confirmed by the provider (APNs returned 200, not 'device delivered'). Norm: Push 85 to 90%, Email 95 to 98%, SMS 99%+
**Open Rate**: percentage of notifications the user clicked. Norm: Push 5 to 10%, Email 20 to 30% (varies by category)
**Opt-out Rate**: percentage of users who turned off a channel after N notifications. Growth above 2%/month signals spam
**P99 Delivery Latency**: time from event to delivery. Critical: under 1 sec, High: under 5 sec, Low: under 60 sec

Airbnb builds analytics on **ClickHouse**. Every lifecycle event of a notification (queued, sent, delivered, opened, failed) is written to an append-only table. The dashboard shows delivery rate in real time. When Apple tightened APNs rules in 2022, Airbnb saw delivery rate drop from 88% to 71% in 10 minutes and rolled back the change before mass user impact.

It is enough to log the fact of sending. If the provider returned 200, the user received it

A 200 from APNs/FCM means 'accepted into the queue', not 'delivered to the device'. Delivery rate and open rate are separate metrics that require callback mechanisms and client-side tracking

APNs returns 200 even if the device is off or the app is uninstalled. Real delivery is confirmed only by the feedback service or a read receipt from the client. Facebook discovered 12% of users have stale push tokens. Without analytics they would be sending millions of 'successful' notifications into the void.

Why use ClickHouse for notification analytics rather than the same PostgreSQL that holds the preference store?

Takeaways

Notification Platform = Event Ingestion + stateless Notification Service + isolated Channel Adapters + Preference Store + Dedup Cache. Centralization removes duplicated logic and gives unified analytics
Each channel (Push, Email, SMS, In-App) lives in a separate adapter with its own retry strategy: SMS uses attempts=1 (the provider retries on its own), Push uses exponential backoff up to 3 attempts
Prioritization runs through separate Kafka topics per priority, not through a field in one topic. Kafka's per-partition sequential order does not let you skip ahead to an important message in the middle of the queue
Analytics is built on event sourcing: every state (queued, sent, delivered, opened, failed) is written to ClickHouse. Delivery rate is not the same as sent rate. Real delivery is confirmed only by feedback/read receipt

Вопросы для размышления

How does the architecture change if you need to support scheduled notifications (send in 24 hours)? Which component gets added and how does it interact with Kafka?
Under peak load (Black Friday at an e-commerce site) the low-priority queue grows faster than it drains. Which strategies help: shed load, adaptive throttling or autoscaling workers? What are the trade-offs of each?
A user complains: 'I get the same notification twice'. Dedup Cache is in place. What are the three most likely causes of duplication and how do you diagnose them?

Связанные уроки

sd-01-intro

Real-Time Backend

Design: Notification Platform

One Facebook notification service outage in 2021 delayed 500 million messages by 40 minutes. How do you design a system that has no right to fall over?

Slack sends more than 500M notifications a day across Push, Email and In-App, all going through a single Notification Platform with centralized routing and deduplication
Uber saves millions of dollars a year by routing non-critical notifications from SMS to Push. A unified platform lets the channel change without touching product services
Airbnb spotted APNs delivery rate dropping from 88% to 71% in 10 minutes thanks to real-time ClickHouse analytics and rolled back the change before mass user impact
Facebook split the notification queue into three priorities, critical security events get delivered in under 1 sec even when the low-priority queue is jammed with millions of social alerts

Notification platform architecture

Key components

**Event Ingestion Layer**: takes events from product services through Kafka/Pub-Sub, buffers load spikes
**Notification Service**: stateless workers that apply business logic: who, which channel, when
**Channel Adapters**: isolated adapters per channel (Push, Email, SMS, In-App), each with its own retry strategy
**Preference Store**: a Redis cache on top of PostgreSQL that holds user settings per channel
**Dedup Cache**: a Redis SET with TTL that prevents the same notification from being sent twice on retry

Why does a Notification Platform need a separate Dedup Cache if Kafka guarantees at-least-once delivery?

Multi-channel delivery

Channel characteristics

**Push (APNs/FCM)**: latency under 1 sec, delivery rate around 85% (offline devices), free, fits real-time alerts
**Email (SendGrid/SES)**: latency 1 to 60 sec, delivery rate around 98%, cost $0.0001/email, fits digests and transactional messages
**SMS (Twilio/Vonage)**: latency under 5 sec, delivery rate around 99%, $0.0075/SMS, only for critical (OTP, safety alerts)
**In-App**: delivery rate 100% (user is online), zero cost, but only works when the app is open
**WebPush**: works in the browser when the tab is closed, delivery rate around 60%, free

Why does the SMS adapter use a retryPolicy with attempts=1 instead of exponential backoff like Push?

Prioritization and rate limiting

Token Bucket for per-user rate limiting

**Global per-channel rate limit**: APNs 300 req/s per DeviceToken, FCM 600 req/s per app, protection from provider blocking
**Per-user per-channel limit**: no more than 5 SMS/hour, 20 push/hour, protects the user from spam
**Burst allowance**: critical events can use burst capacity (2x the limit for 60 seconds) during security incidents
**Quiet hours**: low/high notifications are not sent from 23:00 to 08:00 in the user's local time zone, critical bypasses this

Why use three separate Kafka topics for priority routing instead of a single topic with a `priority: 'high'` field?

Delivery analytics

Key metrics

**Delivery Rate**: percentage of notifications confirmed by the provider (APNs returned 200, not 'device delivered'). Norm: Push 85 to 90%, Email 95 to 98%, SMS 99%+
**Open Rate**: percentage of notifications the user clicked. Norm: Push 5 to 10%, Email 20 to 30% (varies by category)
**Opt-out Rate**: percentage of users who turned off a channel after N notifications. Growth above 2%/month signals spam
**P99 Delivery Latency**: time from event to delivery. Critical: under 1 sec, High: under 5 sec, Low: under 60 sec

It is enough to log the fact of sending. If the provider returned 200, the user received it

A 200 from APNs/FCM means 'accepted into the queue', not 'delivered to the device'. Delivery rate and open rate are separate metrics that require callback mechanisms and client-side tracking

Why use ClickHouse for notification analytics rather than the same PostgreSQL that holds the preference store?

Takeaways

Notification Platform = Event Ingestion + stateless Notification Service + isolated Channel Adapters + Preference Store + Dedup Cache. Centralization removes duplicated logic and gives unified analytics
Each channel (Push, Email, SMS, In-App) lives in a separate adapter with its own retry strategy: SMS uses attempts=1 (the provider retries on its own), Push uses exponential backoff up to 3 attempts
Prioritization runs through separate Kafka topics per priority, not through a field in one topic. Kafka's per-partition sequential order does not let you skip ahead to an important message in the middle of the queue
Analytics is built on event sourcing: every state (queued, sent, delivered, opened, failed) is written to ClickHouse. Delivery rate is not the same as sent rate. Real delivery is confirmed only by feedback/read receipt

Вопросы для размышления

How does the architecture change if you need to support scheduled notifications (send in 24 hours)? Which component gets added and how does it interact with Kafka?
Under peak load (Black Friday at an e-commerce site) the low-priority queue grows faster than it drains. Which strategies help: shed load, adaptive throttling or autoscaling workers? What are the trade-offs of each?
A user complains: 'I get the same notification twice'. Dedup Cache is in place. What are the three most likely causes of duplication and how do you diagnose them?

Связанные уроки

sd-01-intro

Design: Notification Platform

Notification platform architecture

Key components

Multi-channel delivery

Channel characteristics

Prioritization and rate limiting

Token Bucket for per-user rate limiting

Delivery analytics

Key metrics

Takeaways

Related topics

Вопросы для размышления

Связанные уроки

Design: Notification Platform

Notification platform architecture

Key components

Multi-channel delivery

Channel characteristics

Prioritization and rate limiting

Token Bucket for per-user rate limiting

Delivery analytics

Key metrics

Takeaways

Related topics

Вопросы для размышления

Связанные уроки