Orchestration
A user did not receive an SMS with a payment confirmation code. They try 3 times - the code does not arrive. The bank loses the transaction. The problem: the push channel was blocked by the system, the SMS provider was queued behind a marketing blast. Orchestration is what prevents this.
- **PagerDuty:** builds a 4-level escalation push -> email -> SMS -> phone call with timeouts of 30 sec, 5 min, 5 min. Handles >100M notifications per month for DevOps teams worldwide.
- **Airbnb:** stores channel preferences in PostgreSQL, caches them in Redis (TTL 5 min). At 150M users, direct DB calls per notification would create unmanageable load.
- **Duolingo:** personalizes push delivery time based on historical user activity. Open rate is up 3x vs a fixed schedule.
- **Stripe:** for payment notifications uses push + email simultaneously without fallback - both channels matter. SMS only for 2FA OTP. The split by category saves $2-5M per year in SMS costs.
Multi-channel Delivery
Modern apps deliver notifications through several channels in parallel: push, email, SMS, in-app, Slack, WhatsApp. The problem is not using every channel - it is picking the right channel for each event and each user. Sending an SMS about a like is irritating. Not sending an SMS about a critical payment error is a loss of trust.
Twilio processes >1B SMS per month. Their internal stats: SMS in India costs $0.014, in the US $0.0079, in Germany $0.08. Channel choice directly drives product unit economics. Duolingo moved from SMS notifications to push and saved $2M per year.
The orchestration layer decides on the channel based on: event type (transactional vs marketing), user preferences, current status (online/offline), and channel engagement history (have they opened an email in the last 30 days).
A user just made a $500 payment and is waiting for confirmation. Which channel do you pick first?
User Channel Preferences
Preferences are not just 'turn email on/off'. They are a two-dimensional matrix: notification type x channel. A user might want payment notifications via email, mentions via push, and marketing nowhere at all. Storing and applying those settings at scale is non-trivial.
Airbnb stores notification preferences in PostgreSQL but caches them in Redis with a 5-minute TTL. At 150M users, a direct DB lookup per notification would create 10-20k QPS from a single reading system. The cache cuts load by 95%+ at the cost of a slight delay when settings change.
A user changed their settings to disable marketing emails. Two minutes later they get a marketing email. What most likely happened?
Quiet Hours and Time Zones
Quiet hours are a window during which the system holds non-urgent notifications. The main challenge: users live in different time zones, and 'do not disturb between 22:00 and 8:00' maps to different UTC ranges for each of them. A push at 3 AM local time is a guaranteed uninstall.
Duolingo analyzed the data: notifications sent during a user's 'prime time' (usually 19:00-21:00 local) have a 3x higher open rate than those sent outside that window. Personalizing delivery time based on historical activity is the next step after baseline quiet hours.
A user in Tokyo (UTC+9) sets quiet hours 22:00-08:00. At 23:00 UTC (08:00 Tokyo) a marketing notification arrives. What should the system do?
Fallback Chain
A fallback chain is a sequence of backup channels. If push is not delivered (user offline, token expired, push disabled), the system moves to the next channel. Without fallback, critical notifications are lost on any primary-channel failure.
FCM returns 'UNREGISTERED' when the user uninstalled the app or reinstalled without preserving the token. Per Firebase data, ~5-15% of active tokens in a database become invalid within 30 days. The system must auto-clean stale tokens on these signals. Otherwise they clutter the DB and skew delivery metrics.
A fallback chain means sending the notification through every channel at once for reliability.
A fallback chain is a sequence: the next channel is used only if the previous one fails. Sending in parallel through every channel is spam that annoys the user.
The goal is to deliver the notification at least once via any available channel, not to bombard the user across all channels at once. PagerDuty waits 30 seconds after a push before sending an email.
FCM returned 'UNREGISTERED' for a user's push token. What should the fallback chain do?
Summary
- **The channel is set by the event type**: transactional - push+email, 2FA - SMS, marketing - email only; SMS cost ($0.05-0.08) makes it inappropriate for anything but critical messages.
- **Preferences are cached**: direct DB lookups per notification at millions of users are impossible - Redis with a 5-min TTL cuts load by 95% but requires instant invalidation on changes.
- **The fallback chain is sequential, not parallel**: push -> email -> SMS means moving to the next channel only when the previous one fails, not sending through every channel at once.
Related Topics
Notification orchestration relies on several adjacent areas:
- Notification Fan-out — Fan-out (the previous lesson) generates tasks for the orchestrator - millions of events to route across channels
- Caching (Redis) — Preferences and quiet hours live in Redis for fast reads, avoiding a hit to the primary DB on every notification
- Dead Letter Queue — When the fallback chain exhausts every channel, the event lands in the DLQ for later analysis and manual handling
Вопросы для размышления
- A user is traveling in a country without internet. Push does not reach them, they do not check email. They need a 2FA code to sign into the corporate system. What should the fallback chain look like?
- How do you pick the optimal TTL for the channel preferences cache? What happens if the TTL is too short? Too long?
- Product wants to add WhatsApp as a notification channel. What changes does the orchestrator need, and which edge cases come up?