Build Meta's Notification Delivery System
scalability distributed-systems performance
System Design Deep Dive
Meta’s Notification Delivery System
Routing billions of alerts across push, in-app, SMS, and email without overwhelming users or drowning in a viral spike
Think of a postal sorting office that handles 10 billion letters a day, each letter addressed to a single person, but every person has a different preference: some want overnight delivery, some refuse mail before 9am, some have banned certain categories of sender entirely. Now imagine that once a month a single viral event - a celebrity post, an election result, a product launch - causes the sorting office to receive ten times its usual volume in a single hour. That is the core problem of building a notification system at Meta’s scale.
Meta’s notification platform routes alerts across four channels: Apple Push Notification service (APNs) for iOS, Firebase Cloud Messaging (FCM) for Android, SMS via telecom gateways, and email. At baseline, the system delivers roughly 10 billion notifications per day - about 115,000 per second. During a viral event, that rate spikes to 10x in minutes: 1.15 million per second. Each notification must respect the recipient’s channel preferences, do-not-disturb windows, and frequency caps - and it must do so with sub-second end-to-end latency for high-priority alerts.
The tensions are real. Fan-out workers must push to millions of users quickly, but doing so naively overwhelms downstream channel providers. Preference lookups must be fast (under 5ms) or they become the bottleneck. Deduplication must prevent the same notification from firing three times due to retry storms, but the dedup window must be short enough not to suppress legitimate repeat events. None of these are independently hard - the challenge is making all of them work together at scale.
Requirements and Constraints
Functional Requirements
- Accept notification events from any internal service (activity feed, payment system, auth service)
- Route each notification to one or more delivery channels based on user preferences and notification type
- Respect per-user channel opt-outs, do-not-disturb windows, and per-channel frequency caps
- Deduplicate notifications so retried events do not produce duplicate alerts
- Track delivery status per notification (sent, delivered, read, failed)
- Support at least 4 channels: push (APNs/FCM), in-app, SMS, email
Non-Functional Requirements
- Baseline throughput: 115,000 notifications/second
- Spike throughput: 1,150,000 notifications/second (10x for up to 1 hour)
- End-to-end latency for P0 notifications: p99 under 500ms
- Preference lookup latency: p99 under 5ms
- Deduplication window: 30 seconds (same event+user key suppressed within window)
- Delivery receipt recording: within 10 seconds of send
- System availability: 99.99% (52 minutes downtime/year)
Constraints
- APNs and FCM impose their own rate limits per app certificate; SMS gateways charge per message and throttle at 100/second per Twilio number
- Push channels are best-effort: APNs/FCM may silently drop notifications for offline devices
- Email has the highest deliverability complexity: SPF/DKIM, bounce handling, spam classification
- Users can change preferences at any time; the preference store must reflect changes within 5 seconds
High-Level Architecture
The system splits into three logical layers: the notification core (ingestion, routing, fan-out), the channel adapters (per-provider delivery), and the feedback layer (delivery receipts, retry, analytics).
Notification Service is the entry point. Internal services publish notification events to a Kafka topic. The notification service consumes these events, applies an initial deduplication check against Redis, and writes the notification record to a persistent store (Cassandra) before handing off to the routing engine. Writing to Cassandra before dispatching is critical - it ensures no notification is lost even if the routing tier crashes.
The Routing Engine looks up user preferences from a Redis-backed preference store, evaluates do-not-disturb windows and frequency caps, and emits one task per channel per user to the appropriate priority queue. Priority queues are implemented in Kafka with separate topics per priority tier (P0 through P3).
Fan-out Workers consume from the priority queues and call channel adapters. For viral events generating millions of notifications for a single notification type, fan-out workers are the horizontal scaling lever - more consumers on the Kafka topic means proportionally more throughput.
Channel Adapters are thin wrappers over external provider APIs. They translate internal notification payloads into provider-specific formats and handle provider-specific error codes. Each adapter reports delivery outcomes back to the feedback layer via an internal event.
The decision to use Kafka as both the ingestion buffer and the priority queue backbone is deliberate. Kafka gives you consumer-lag metrics for free, which is exactly what you need to trigger autoscaling. When consumer lag on the P1 topic crosses 50,000 messages, the autoscaler adds worker pods. Lag falling below 5,000 triggers scale-in. Without Kafka’s visible lag, you would need a separate queue-depth monitoring system.
Notification Routing Engine
The routing engine is the brain of the system. It answers one question for every incoming notification: which channels should deliver this alert to this user, right now?
Every notification event carries a type (like, comment, security alert, marketing), a priority (P0 through P3), and a recipient user ID. The routing engine evaluates four conditions in sequence before dispatching to any channel.
First, it performs a deduplication check. If the same event ID has been seen within the last 30 seconds (checked via Redis SETNX), the notification is dropped silently. This handles the common case of retry storms from upstream services.
Second, it loads the user’s preference record from Redis. The preference record is a compact structure encoding channel opt-ins as a bitmask, per-channel frequency cap counters, and a do-not-disturb schedule stored as a pair of Unix epoch offsets (DND start, DND end, with a repeating weekday bitmask). A preference cache miss falls back to the preference database (MySQL) and re-populates the cache with a 60-second TTL.
Third, it evaluates the DND window. If the current time falls within the user’s DND schedule, P2 and P3 notifications are held in a delayed queue and re-enqueued at the DND end time. P0 (security, OTP) and P1 (social) notifications bypass the DND window.
Fourth, it evaluates frequency caps. Each user has a rolling counter per channel stored in Redis with a 24-hour sliding window. Push notifications cap at 10 per day for marketing types. SMS caps at 3 per week. Email caps at 2 per week for non-transactional notifications. If a cap is reached, the notification is dropped with a counter increment.
Frequency caps must use Redis atomic increment operations (INCR + EXPIRE) rather than read-modify-write patterns. Under high concurrency, a read-then-write approach allows multiple workers to each read a count of 9, each decide “under cap”, and each increment to 10 - sending 3 notifications when only 1 should go through. The Redis INCR command is atomic and solves this in a single round trip.
Preference Store
The preference store holds the per-user configuration that governs routing decisions. It is the most read-heavy component in the system - every notification dispatched requires at least one preference lookup.
The preference store is a two-tier architecture. The hot tier is a Redis cluster with consistent hashing across 16 shards. Each user’s preference record is stored under the key pref:{user_id} as a Redis hash. The hash fields include: channels (bitmask of enabled channels), dnd_start (HH:MM in UTC), dnd_end (HH:MM in UTC), dnd_days (weekday bitmask), and per-channel opt-out flags for notification types.
The cold tier is a MySQL table (user_notification_preferences) that is the source of truth. The routing engine reads from Redis first. On a cache miss, it fetches from MySQL and writes back to Redis with a 300-second TTL. Preference updates (when a user changes settings in the app) write to MySQL first, then invalidate the Redis key. The preference update service also pushes the new value directly to Redis to avoid a cold-miss window.
CREATE TABLE user_notification_preferences (
user_id BIGINT NOT NULL,
channel ENUM('push', 'inapp', 'sms', 'email') NOT NULL,
notification_type VARCHAR(64) NOT NULL,
enabled TINYINT(1) NOT NULL DEFAULT 1,
dnd_start_utc TIME,
dnd_end_utc TIME,
dnd_weekdays TINYINT UNSIGNED DEFAULT 127,
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (user_id, channel, notification_type),
INDEX idx_updated (updated_at)
) ENGINE=InnoDB;
Meta’s actual preference store is backed by TAO (The Associations and Objects), the same graph database used for the social graph. TAO provides sub-millisecond local reads via a layered cache architecture. For teams without TAO, a Redis cluster with careful key design and a MySQL fallback achieves similar performance characteristics at smaller scale.
Fan-Out Workers
Fan-out workers are stateless consumer processes that read notification tasks from priority queues and call channel adapters. Each worker manages a pool of goroutines (or async threads) that handle concurrent adapter calls.
The fan-out worker pool is partitioned by user ID modulo the number of Kafka partitions. A notification for user ID 1234567 always routes to the same Kafka partition, which means it always lands on the same worker process. This is important for ordering guarantees: if two notifications arrive for the same user in sequence, they are dispatched in order rather than racing across different workers.
Each worker maintains a connection pool to each channel adapter. Adapter calls are asynchronous - the worker does not block waiting for APNs to respond. Instead, it writes a “send attempt” record to the delivery tracker and the adapter responds via a callback when the send completes or fails.
import asyncio
from dataclasses import dataclass
from typing import Optional
import time
PRIORITY_QUEUES = {0: "notif-p0", 1: "notif-p1", 2: "notif-p2", 3: "notif-p3"}
MAX_CONCURRENT_PER_WORKER = 256
@dataclass
class NotificationTask:
notification_id: str
user_id: int
channel: str
priority: int
payload: dict
attempt: int = 0
created_at: float = 0.0
class FanoutWorker:
def __init__(self, adapters: dict, delivery_tracker, dedup_store):
self.adapters = adapters
self.tracker = delivery_tracker
self.dedup = dedup_store
self.semaphore = asyncio.Semaphore(MAX_CONCURRENT_PER_WORKER)
async def process(self, task: NotificationTask) -> None:
async with self.semaphore:
dedup_key = f"sent:{task.notification_id}:{task.channel}"
if not await self.dedup.set_if_absent(dedup_key, ttl=60):
return # already dispatched by another worker (duplicate)
adapter = self.adapters.get(task.channel)
if adapter is None:
return
await self.tracker.record_attempt(
task.notification_id, task.channel, task.attempt
)
try:
result = await adapter.send(task.user_id, task.payload)
await self.tracker.record_outcome(
task.notification_id, task.channel, "sent", result.provider_id
)
except Exception as exc:
await self.tracker.record_outcome(
task.notification_id, task.channel, "failed", error=str(exc)
)
if task.attempt < 4:
delay = 2 ** task.attempt # 1s, 2s, 4s, 8s
await self._requeue(task, delay)
async def _requeue(self, task: NotificationTask, delay_seconds: float) -> None:
task.attempt += 1
task.created_at = time.time() + delay_seconds
queue = PRIORITY_QUEUES[task.priority]
# Re-enqueue to delayed queue implementation (Kafka with scheduled delivery)
await self.dedup.remove(f"sent:{task.notification_id}:{task.channel}")
The semaphore limit on concurrent adapter calls per worker is critical. Without it, a slow APNs response during a spike causes the worker’s goroutine pool to fill entirely with blocked APNs calls, starving FCM and SMS deliveries that could succeed instantly. Always use separate connection pools and concurrency limits per channel adapter, not a single shared pool.
Channel Adapters
Each delivery channel has different SLA characteristics, rate limits, and error semantics. Channel adapters normalize these differences so the fan-out worker sees a uniform interface.
APNs (Apple Push Notification service): APNs uses HTTP/2 with persistent connections. Meta maintains a pool of 50 persistent HTTP/2 connections to APNs per region. Each connection supports up to 1,500 concurrent streams, giving a theoretical throughput of 75,000 pushes/second per region. APNs returns a 410 Gone status when a device token is invalid (device uninstalled the app), which triggers a token cleanup job. APNs does not provide real-time delivery receipts - “sent” to APNs means delivered to Apple’s servers, not the device.
FCM (Firebase Cloud Messaging): FCM uses a REST API with token-based authentication refreshed hourly. FCM supports batch send (up to 500 tokens per request), which is used for broadcast notifications. For personalized notifications (different payloads per user), single-send is used. FCM’s InvalidRegistration error code similarly triggers token cleanup.
SMS via Twilio: Twilio enforces rate limits of 100 messages/second per long code number and 400/second per short code. For burst traffic, the adapter distributes send requests across a pool of 20 short codes. SMS is used only for P0 (OTP, security) and P1 (account alerts) notifications due to per-message cost.
Email via Amazon SES: SES enforces sending limits based on account reputation. Transactional email is separated from marketing email into different sending identities (different From addresses and separate SES accounts) to protect the transactional reputation score. Email delivery receipts use SES SNS callbacks for bounces and complaints.
Apple APNs requires HTTP/2 and uses JWT tokens (not device tokens) for server authentication since 2021, rotating every hour. Firebase FCM deprecated the legacy HTTP API in mid-2024 in favor of the v1 API using OAuth 2.0 service account credentials. Any production adapter must handle credential rotation transparently without dropping in-flight sends.
Deduplication Layer
Deduplication prevents the same notification from being delivered multiple times due to upstream retries, consumer group rebalancing, or at-least-once delivery semantics in Kafka.
The dedup layer operates at two points in the pipeline. The first check is at ingestion time: when the notification service receives an event from an upstream service, it computes a dedup key from the event source, event type, actor ID, and target user ID. It attempts a Redis SETNX on dedup:ingest:{hash} with a 30-second TTL. If SETNX returns 0 (key exists), the event is a duplicate and is dropped at the cheapest possible point.
The second check is at dispatch time: before the fan-out worker calls a channel adapter, it sets dedup:send:{notification_id}:{channel} with a 60-second TTL. This guards against two workers racing to send the same notification after a Kafka partition rebalance gives both workers the same unconsumed message.
import hashlib
import redis.asyncio as aioredis
class DeduplicationStore:
def __init__(self, redis_client: aioredis.Redis):
self.redis = redis_client
def _ingest_key(self, event_source: str, event_type: str,
actor_id: int, user_id: int) -> str:
raw = f"{event_source}:{event_type}:{actor_id}:{user_id}"
digest = hashlib.blake2b(raw.encode(), digest_size=16).hexdigest()
return f"dedup:ingest:{digest}"
async def check_and_mark_ingest(self, event_source: str, event_type: str,
actor_id: int, user_id: int,
ttl_seconds: int = 30) -> bool:
"""Returns True if this event is new (should be processed)."""
key = self._ingest_key(event_source, event_type, actor_id, user_id)
result = await self.redis.set(key, "1", nx=True, ex=ttl_seconds)
return result is not None # None means key already existed
async def set_if_absent(self, key: str, ttl: int = 60) -> bool:
"""Returns True if key was newly set (caller should proceed)."""
result = await self.redis.set(key, "1", nx=True, ex=ttl)
return result is not None
async def remove(self, key: str) -> None:
await self.redis.delete(key)
The dedup window size involves a trade-off. A 30-second window prevents duplicates from retried upstream events but means that if a user legitimately receives two “someone liked your post” notifications within 30 seconds (two different people liked the same post), only the first one delivers. For high-velocity events like likes on viral content, this is acceptable. For security alerts (OTP), the dedup key includes a timestamp bucket at 10-second granularity, so OTP notifications are never suppressed if the user requests a new code.
Use a two-level dedup strategy: coarse dedup at ingestion (event hash, 30s TTL) to stop upstream storms early, and fine dedup at dispatch (notification ID per channel, 60s TTL) to handle worker failover. The ingestion-level dedup is cheap (a single Redis key per event) and eliminates the vast majority of duplicates before any preference lookup or database write occurs.
Delivery Confirmation and Feedback
Knowing that a notification was sent to APNs is not the same as knowing it reached the user’s device. The feedback layer tracks notifications through their full lifecycle.
The delivery tracker records four states for each notification-channel pair: pending, sent (submitted to provider), delivered (provider confirmed device receipt where available), and failed. Each state transition is written to a Cassandra table partitioned by (user_id, date) for efficient per-user notification history queries, with a secondary index on notification_id for lookup by ID.
APNs and FCM both support delivery receipts via different mechanisms. APNs does not push receipts; instead, the client app calls a receipt validation API on first open after receiving a notification, and the app SDK reports this back via an internal event. FCM’s Data API (for Android) provides true delivery receipts when devices are online. SMS delivery receipts come via Twilio’s webhook callback (DLR - delivery receipt) within 30 seconds for domestic carriers and up to 72 hours for international.
The retry strategy uses exponential backoff: first retry at 2 seconds, second at 4 seconds, third at 8 seconds, fourth at 16 seconds. After 4 failed attempts, the notification moves to a dead letter topic in Kafka. A separate DLQ consumer aggregates failures and fires an alert if the DLQ depth exceeds 1,000 messages per minute, which signals a systematic adapter failure.
Cassandra is the right choice for the notification history table because the access pattern is almost exclusively: “get all notifications for user X in the last 7 days.” Partitioning by (user_id, date) makes this a single-partition scan. The write pattern (continuous appends with rare updates for state transitions) plays to Cassandra’s strengths.
Spike Handling
When a viral event - a celebrity posting, a major sports result, a breaking news event - causes 10x traffic in seconds, the system must absorb the surge without degrading quality for high-priority notifications.
The Kafka intake layer is the surge absorber. Kafka’s append-only log can absorb millions of messages per second without back-pressuring producers. The key metric is consumer lag - the gap between the last produced offset and the last consumed offset on each priority partition.
The autoscaler monitors consumer lag via a Prometheus metric (kafka_consumer_lag_sum) aggregated by topic. When P1 lag crosses 50,000 messages, the Kubernetes HPA adds fan-out worker pods. Each new pod spins up in under 3 seconds (pre-warmed container images) and immediately starts consuming. The maximum pod count is capped at 50 per priority tier to prevent overwhelming channel adapter rate limits.
Load shedding is the safety valve when scaling alone cannot keep up. At P2/P3 lag above 100,000 messages, the routing engine starts dropping P3 (digest, marketing) notifications entirely rather than queuing them. This is preferable to letting the queue grow unboundedly. P3 notifications that are shed are recorded with a dropped_surge status - the user never receives them, which is acceptable for low-priority content.
Channel adapter rate limits are enforced at the worker level. Each worker maintains a token bucket per channel. APNs tokens refill at 60,000 per second per HTTP/2 connection. If the bucket is empty, the worker parks the task in a 100ms sleep-retry loop rather than returning it to Kafka, which would cause redelivery overhead.
import asyncio
import time
class TokenBucketRateLimiter:
def __init__(self, rate_per_second: float, burst_capacity: float):
self.rate = rate_per_second
self.capacity = burst_capacity
self.tokens = burst_capacity
self.last_refill = time.monotonic()
self._lock = asyncio.Lock()
async def acquire(self, tokens: float = 1.0) -> None:
async with self._lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.rate
)
self.last_refill = now
if self.tokens >= tokens:
self.tokens -= tokens
return
# Not enough tokens - wait for refill
wait_time = (tokens - self.tokens) / self.rate
await asyncio.sleep(wait_time)
await self.acquire(tokens)
# Per-adapter rate limiters
ADAPTER_RATE_LIMITS = {
"apns": TokenBucketRateLimiter(rate_per_second=60_000, burst_capacity=5_000),
"fcm": TokenBucketRateLimiter(rate_per_second=30_000, burst_capacity=3_000),
"sms": TokenBucketRateLimiter(rate_per_second=400, burst_capacity=200),
"email": TokenBucketRateLimiter(rate_per_second=1_000, burst_capacity=500),
}
Twitter (now X) described their notification system as encountering 50x spikes during major live events like the Super Bowl and World Cup finals. Their approach was to pre-provision capacity based on the event calendar and use staged rollout - notifications for the same event were sent in batches of 10% of recipients over 5 minutes rather than all at once, converting a single spike into a manageable ramp.
Data Model
The core tables for the notification system cover: notification records, preference store, and delivery receipts.
-- Core notification record (written at ingestion, before routing)
CREATE TABLE notifications (
notification_id UUID NOT NULL,
user_id BIGINT NOT NULL,
event_source VARCHAR(64) NOT NULL,
notification_type VARCHAR(64) NOT NULL,
priority TINYINT NOT NULL DEFAULT 2,
payload JSON NOT NULL,
dedup_key VARCHAR(128) NOT NULL,
created_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
expires_at TIMESTAMP,
PRIMARY KEY (notification_id),
UNIQUE KEY uq_dedup (dedup_key),
INDEX idx_user_created (user_id, created_at)
) ENGINE=InnoDB;
-- Delivery state per notification per channel
CREATE TABLE notification_deliveries (
notification_id UUID NOT NULL,
channel ENUM('push_apns','push_fcm','inapp','sms','email') NOT NULL,
status ENUM('pending','sent','delivered','failed','dropped') NOT NULL DEFAULT 'pending',
attempt_count TINYINT NOT NULL DEFAULT 0,
provider_message_id VARCHAR(256),
sent_at TIMESTAMP,
delivered_at TIMESTAMP,
failed_reason VARCHAR(512),
updated_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (notification_id, channel),
INDEX idx_status_updated (status, updated_at)
) ENGINE=InnoDB;
-- Per-user device token registry
CREATE TABLE user_device_tokens (
user_id BIGINT NOT NULL,
device_id VARCHAR(128) NOT NULL,
channel ENUM('push_apns','push_fcm') NOT NULL,
token VARCHAR(512) NOT NULL,
platform_version VARCHAR(32),
registered_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
last_seen_at TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
is_active TINYINT(1) NOT NULL DEFAULT 1,
PRIMARY KEY (user_id, device_id, channel),
UNIQUE KEY uq_token (token),
INDEX idx_active_user (is_active, user_id)
) ENGINE=InnoDB;
-- Frequency cap counters (also stored in Redis for hot path)
CREATE TABLE notification_frequency_caps (
user_id BIGINT NOT NULL,
channel ENUM('push_apns','push_fcm','sms','email') NOT NULL,
notification_type VARCHAR(64) NOT NULL,
window_start TIMESTAMP NOT NULL,
send_count INT NOT NULL DEFAULT 0,
PRIMARY KEY (user_id, channel, notification_type, window_start)
) ENGINE=InnoDB;
Key Algorithms and Protocols
Preference Evaluation runs on every incoming notification and must complete under 5ms. The algorithm checks four conditions in sequence, short-circuiting on the first reason to drop or delay.
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime, time
import time as timem
@dataclass
class UserPreferences:
channel_flags: int # bitmask: bit0=push, bit1=inapp, bit2=sms, bit3=email
type_opt_outs: set # set of notification types opted out
dnd_start: Optional[time] # UTC time
dnd_end: Optional[time] # UTC time
dnd_weekdays: int # bitmask: bit0=Mon, bit6=Sun
push_daily_count: int
sms_weekly_count: int
email_weekly_count: int
PUSH_DAILY_CAP = 10
SMS_WEEKLY_CAP = 3
EMAIL_WEEKLY_CAP = 2
CHANNEL_BIT = {"push": 0, "inapp": 1, "sms": 2, "email": 3}
def evaluate_channel(
prefs: UserPreferences,
channel: str,
notification_type: str,
priority: int,
now_utc: datetime
) -> str:
"""Returns 'send', 'drop', or 'delay'."""
# 1. Channel opt-in check
bit = CHANNEL_BIT.get(channel, -1)
if bit == -1 or not (prefs.channel_flags & (1 << bit)):
return "drop"
# 2. Type opt-out check
if notification_type in prefs.type_opt_outs:
return "drop"
# 3. DND window check (P0/P1 bypass)
if priority >= 2 and prefs.dnd_start and prefs.dnd_end:
weekday_bit = 1 << now_utc.weekday()
if prefs.dnd_weekdays & weekday_bit:
current_t = now_utc.time()
if prefs.dnd_start <= current_t <= prefs.dnd_end:
return "delay"
# 4. Frequency cap check (P0 bypasses)
if priority > 0:
if channel == "push" and prefs.push_daily_count >= PUSH_DAILY_CAP:
return "drop"
if channel == "sms" and prefs.sms_weekly_count >= SMS_WEEKLY_CAP:
return "drop"
if channel == "email" and prefs.email_weekly_count >= EMAIL_WEEKLY_CAP:
return "drop"
return "send"
Deduplication with Redis uses a two-key scheme with atomic SET NX operations.
import hashlib
import redis
def make_ingest_dedup_key(event_source: str, event_type: str,
actor_id: int, user_id: int) -> str:
raw = f"{event_source}|{event_type}|{actor_id}|{user_id}"
digest = hashlib.sha256(raw.encode()).hexdigest()[:24]
return f"d:i:{digest}"
def check_and_mark(r: redis.Redis, key: str, ttl: int) -> bool:
"""Returns True if this is a new (non-duplicate) event."""
# SET key "1" NX EX ttl - atomic, returns None if key already exists
result = r.set(key, "1", nx=True, ex=ttl)
return result is True
# Usage at ingestion
dedup_key = make_ingest_dedup_key("activity", "like", actor_id=9876, user_id=1234)
is_new = check_and_mark(redis_client, dedup_key, ttl=30)
if not is_new:
return # duplicate, drop early
Scaling and Performance
Back-of-envelope estimates anchor the architecture decisions.
Daily volume: 10 billion notifications / 86,400 seconds = 115,740/second baseline.
Viral spike: 10x for 1 hour = 1,157,400/second peak.
Preference store reads: every notification requires 1 Redis lookup. At 115K/sec baseline, that is 115K Redis reads/second. A single Redis shard handles ~100K reads/second, so we need 2 shards at baseline. At 10x spike, we need 12 shards with headroom.
Fan-out workers: each worker handles 256 concurrent adapter calls with an average adapter latency of 200ms. That is 256 / 0.2 = 1,280 completions/second per worker. At 1.15M/sec peak, we need 1,150,000 / 1,280 = ~900 workers. With 50 workers per Kubernetes pod (threading), that is 18 pods at peak.
Cassandra delivery records: each notification generates 2-3 channel delivery records. At 10B notifications/day, that is 25B writes/day = ~290K writes/second. A 6-node Cassandra cluster with replication factor 3 handles 300K writes/second comfortably.
Redis memory for dedup: each dedup key is ~50 bytes key + 1 byte value + overhead = ~100 bytes. At 1.15M events/second with a 30-second TTL, the active key set is 1,150,000 x 30 = 34.5M keys x 100 bytes = 3.45 GB for the ingest dedup tier. A single 8 GB Redis instance handles this.
Kafka storage: at 1.15M events/second, each event ~1KB = 1.15 GB/second. With a 1-hour retention on the priority queues, that is 4.14 TB of Kafka storage. Use log compaction with short retention (1 hour) and S3 archival for audit trails.
Failure Modes and Recovery
| Failure Scenario | Detection | Impact | Recovery |
|---|---|---|---|
| Redis preference cache total failure | Cache miss rate spikes to 100%; p99 latency exceeds 50ms SLO | All preference lookups fall back to MySQL; 20x latency increase, risk of MySQL overload | Pre-warmed MySQL read replicas with connection pooling; circuit breaker triggers “allow all channels” fallback to keep notifications flowing |
| APNs rate limit or outage | HTTP 429 / 5xx from APNs; delivery tracker reports 0 sent in 60s window | iOS push notifications stopped; users with Android/email unaffected | Exponential backoff retry queue for APNs tasks; failover to in-app channel for same notifications if user is online |
| Kafka broker leader election | Consumer lag spikes; fan-out worker group rebalance | 15-30 second gap in processing; potential duplicate delivery after rebalance | Dedup keys at dispatch level absorb duplicates; Kafka controller typically elects new leader within 30 seconds |
| Worker pod crash during delivery | Delivery tracker shows large cohort stuck in pending state for 60s+ | Notifications stuck in in-flight state; will not retry automatically | Sweeper job runs every 60 seconds; re-enqueues any task stuck in pending for more than 90 seconds |
| SMS gateway (Twilio) outage | DLR callbacks stop; SMS delivery rate drops to 0 | P0/P1 SMS not delivered; OTP flow broken for SMS-only users | Fallback to secondary SMS gateway (Vonage) for P0 notifications; alert on-call for P1 failures |
| Preference store MySQL outage | Cache misses return errors; fallback fails | Routing engine cannot evaluate preferences | Stale-on-error strategy: serve last-cached preference value for up to 5 minutes before degrading to “send all channels” mode |
Comparison of Approaches
Fan-out on Write vs Fan-out on Read
Fan-out on write (the approach above) pre-computes which notifications to send at event time. This makes the send path fast but means the fan-out workers must process 10x events during a spike. Fan-out on read would compute routing at delivery time - simpler at write time but requires reading preferences for millions of users on each notification poll. For a notification system where users expect near-real-time delivery, fan-out on write is the right default.
Push vs. Pull notification model for devices
The above architecture uses a push model: the server initiates delivery to the device via APNs/FCM. The alternative is a pull model where the app polls for new notifications. Pull is simpler and avoids maintaining a persistent connection, but introduces latency equal to the poll interval. For social notifications where 1-5 second latency is acceptable, a hybrid is common: push for high-priority, poll every 60 seconds as a fallback.
Single global queue vs. Priority queues
A single queue mixes P0 security alerts with P3 digest emails. During a spike, the queue fills with low-priority content, causing high-priority notifications to wait behind them. Priority queues (separate Kafka topics per tier) ensure P0 always drains first. The cost is operational complexity: 4 topics to monitor instead of 1. The benefit is predictable latency for critical notifications regardless of overall system load.
APNs vs. FCM direct vs. unified push layer
Companies with both iOS and Android users face a choice: call APNs and FCM directly (simpler, more control) or use a unified push service like Firebase Notifications, OneSignal, or a home-built abstraction. Direct calls are lower latency and avoid a third party in the critical path. A unified layer simplifies the adapter code but adds a vendor dependency and one more network hop. For a system at Meta’s scale, direct adapter calls with first-party credential management are preferred.
Key Takeaways
- Kafka is the backbone: it decouples event producers from notification workers, provides natural backpressure via consumer lag metrics, and makes autoscaling decisions mechanical rather than heuristic
- Dedup at two levels: ingest-time dedup (30s window) stops upstream retry storms before any preference lookup; dispatch-time dedup (60s window) handles worker failover races; each layer is cheap and complementary
- Preference lookups must be sub-5ms: the routing engine is on the critical path for every notification; Redis with a MySQL fallback achieves this; MySQL alone does not
- Priority queues are not optional: mixing P0 security alerts with P3 marketing digests in a single queue guarantees OTP delivery failures during traffic spikes
- Load shedding beats infinite queuing: dropping P3 content during a surge is better than a queue that never drains; explicitly record dropped notifications with a
dropped_surgestatus for analytics - APNs sent is not device delivered: track the full lifecycle from provider acknowledgment through client-side open event; pure send-side metrics overstate notification effectiveness by 20-40%
- Channel adapters must be isolated: a slow APNs response must not block FCM sends; use separate connection pools, separate concurrency limits, and separate rate limiters per channel
Frequently Asked Questions
How do you handle a user who uninstalls the app but still has an active APNs token?
APNs returns a 410 Gone status when a send attempt is made to a stale token. The channel adapter records this response and triggers a token invalidation event. A background job reads these events and sets is_active = 0 on the device token record. The routing engine checks is_active before dispatching to push channels. This is eventually consistent - the first notification to a stale token fails and triggers cleanup, subsequent ones skip the channel.
How do you prevent a user from receiving 50 notifications during a viral event (e.g., their post goes viral and 50 people like it in 1 minute)?
The dedup layer handles exact duplicates (same event type and actor), but 50 different people liking the same post generates 50 distinct events. For this case, the frequency cap is the primary control (push capped at 10/day for social notifications). Additionally, the routing engine can apply event aggregation: instead of sending a notification per like, it buffers like events for 30 seconds and sends a single aggregated notification (“UserA and 49 others liked your post”). This aggregation happens inside the routing engine before fan-out.
What happens to notifications for users who are offline for 30 days?
APNs and FCM both have device-side storage with a configurable TTL (default 28 days). If the device comes online before the TTL, it receives the stored notification. Beyond that, the notification is lost at the provider level. For the notification store in Cassandra, records are retained for 90 days for history display, but the delivery status will remain sent with no delivered_at timestamp if the provider expires it.
How do you handle the fan-out for a user with 3 billion followers (hypothetical broadcast)?
At true broadcast scale, per-user fan-out breaks down even with Kafka. The solution shifts from user-centric fan-out to topic subscriptions. Rather than writing to 3 billion individual notification queues, the event is written once to a broadcast topic. Each connected device’s push connection server subscribes to relevant broadcast topics and pushes the notification directly. This is how large-scale broadcast notifications (OS update available, emergency alert) work - they bypass the per-user routing pipeline entirely.
How do you ensure notification order within a channel for the same user?
Kafka partitions by user ID ensure all notifications for a given user route to the same partition and are processed by the same fan-out worker. Within a single worker, notifications are processed from the queue in offset order. Channel adapter calls are async but the delivery tracker records submission time, so the notification history UI can display in creation order regardless of delivery timing.
What is your strategy for GDPR notification deletion requests?
On a user deletion request, the system: (1) sets channel_flags = 0 in the preference store immediately (stops new notifications), (2) drains in-flight notifications by marking all pending delivery records for this user as suppressed, (3) schedules Cassandra records for deletion within 30 days. The dedup keys expire naturally (30-60 second TTL). Device tokens are deleted immediately.
Interview Questions
Design the deduplication layer for a notification system that receives events from 50 upstream microservices, each with at-least-once delivery guarantees.
Expected depth: Candidate should identify that dedup must happen at ingestion before any persistence or fan-out occurs. Should design the dedup key as a hash of (source service, event type, actor, recipient, time bucket) with a TTL. Should distinguish between idempotency (same exact event) and aggregation (similar events within a time window). Should discuss Redis SETNX atomicity and why a database unique constraint is too slow for this path. Strong candidates discuss the trade-off between window size (short = more duplicates survive, long = more legitimate rapid events suppressed) and explore per-notification-type window configurations.
How would you redesign the preference store to handle 1 billion users with sub-5ms lookup latency at 1 million reads per second?
Expected depth: Candidate should immediately identify Redis cluster with consistent hashing as the hot tier. Should size the Redis cluster: 1M reads/sec at 100K/node = 10 nodes minimum. Should discuss cache warming strategy (all users vs. active users only) and memory sizing (if each preference record is 200 bytes, 1 billion users = 200 GB - use active user cache only, evict LRU). Should discuss cache invalidation on preference change (write-through vs. write-around), and the fallback path to MySQL with connection pool sizing. Strong candidates propose a read-through caching pattern with a fallback “default preferences” value to prevent MySQL thundering herd on cache miss storms.
A viral event causes 10x notification traffic for 90 minutes. Walk through your system’s response from first event to full recovery.
Expected depth: Candidate should describe the lag-based autoscaling trigger, the priority queue shed order (P3 first, then P2), and the pre-warmed pod pool for sub-30-second scale-out. Should quantify: at what lag threshold does the autoscaler fire? How long does pod scale-out take vs. how fast is the queue growing? Should discuss provider rate limits as a secondary ceiling (can’t scale past APNs throughput limits), and the recovery path (gradual scale-in after lag normalizes, replaying shed P3 notifications if business requirements demand it).
How would you design the retry and dead letter queue strategy for a notification system where some channels (SMS) are expensive per message and others (push) are free?
Expected depth: Candidate should design per-channel retry policies: push gets 4 retries (cheap), SMS gets 1 retry (expensive), email gets 2 retries. DLQ is per-channel so SMS DLQ failures are immediately alerted (P0 risk, cost impact) while push DLQ failures are batched into a daily report. Should discuss idempotency at retry time (using the dispatch-level dedup key to ensure a retried SMS does not double-send if the original succeeded but the receipt was lost). Strong candidates discuss the difference between provider-confirmed failure (don’t retry, user unreachable) vs. timeout (retry is safe) and how to distinguish them from provider error codes.
Design the notification aggregation system that converts “50 people liked your photo in the last 30 seconds” from 50 individual events into one notification.
Expected depth: Candidate should design an aggregation buffer per (user, notification type, entity) keyed as agg:{user_id}:{type}:{entity_id}. The routing engine writes to this buffer with a 30-second TTL instead of directly dispatching. A separate aggregation worker reads the buffer at TTL expiry, generates a single aggregated payload (“Alice and 49 others liked your photo”), and dispatches it. Should discuss the timing precision trade-off: strict 30-second window means the first event waits up to 30 seconds; sliding window means the user always sees aggregated counts but with variable delay. Should discuss what happens if the user opens the app during the buffer window (should the in-app channel deliver immediately while push waits for aggregation?).
Premium Content
Unlock the full article along with everything else in the archive — all in one place.