Industry-standard async notification pipeline using Kafka for decoupled delivery. NotifyService validates requests, checks user preferences in Redis, renders templates, and publishes to Kafka with priority partitioning. Channel-specific workers deliver via FCM and SendGrid with retry and deduplication.
The async pipeline with priority queue architecture is the industry-standard approach for notification systems handling thousands of notifications per second. It solves the two fundamental problems that make the naive synchronous approach unworkable at scale: thread pool saturation from slow third-party calls and lack of retry coordination for transient failures. This is the design that interviewers at companies like Twilio, SendGrid, Slack, and every major mobile-first platform expect candidates to propose after identifying the synchronous bottleneck.
The core architectural shift is decoupling notification acceptance from delivery. Instead of calling FCM or SendGrid synchronously in the API request path (holding a thread for 200-500ms), the service validates the request, checks user preferences, renders templates, and publishes a notification event to Kafka. The API returns 202 Accepted in approximately 50ms — the caller knows the notification has been accepted for delivery, but delivery itself happens asynchronously in background workers. This separation of concerns — acceptance vs. delivery — is a foundational distributed systems pattern.
This decoupling provides three critical benefits. First, API throughput increases from ~500 RPS to 10,000+ RPS because threads are held for only ~50ms (preference check + template render + Kafka publish) instead of 300ms (third-party call). With 800 threads across 8 pods, the throughput ceiling rises dramatically. Second, retry coordination is centralized in the workers. PushWorker and EmailWorker implement exponential backoff on transient failures (FCM token refresh, SendGrid rate limit) and route permanent failures (invalid token, bounced email) to a dead-letter queue. The caller never needs to implement retry logic — the system guarantees delivery or explicit failure via DLQ. Third, the system is resilient to third-party outages — Kafka buffers notifications during outages (millions of messages with multi-day retention), and workers deliver them when the third party recovers. No notifications are lost.
The priority partitioning design is a critical differentiator for notification systems. Without priority separation, a 10-million-recipient marketing blast fills the Kafka topic for minutes. An auth code (time-critical, single recipient) would wait behind millions of marketing messages — a phenomenon known as head-of-line blocking. Priority partitioning solves this by reserving Kafka partitions 0-3 for transactional notifications (auth codes, payment alerts) with dedicated worker capacity that is never shared with marketing. Bulk marketing notifications go to partitions 4-15 with separate consumer groups. This ensures a 2FA code is delivered within seconds regardless of whether a massive marketing campaign is running concurrently.
The system handles 100 million notifications per day at peak throughput of 10,000 per second. Push delivery via FCM/APN takes approximately 200ms per message; email delivery via SendGrid/SES takes approximately 500ms per batch. User preferences (opt-out, quiet hours, channel selection) are cached in Redis with a 98% hit rate, adding only 2ms to the API path. Templates are rendered at API time, so Kafka messages carry the final rendered content and workers do not need database access. Idempotent delivery via a Redis dedup set (notification_id + recipient, 24-hour TTL) prevents duplicate notifications even if Kafka consumer offsets fail.
This architecture is used by virtually every production notification system at scale. The pattern generalizes beyond notifications to any system that needs to decouple fast API responses from slow downstream processing — email, webhooks, data pipeline ingestion, and event-driven microservices all use the same Kafka-backed async pipeline pattern. Interviewers expect candidates to arrive at this design after identifying the synchronous delivery bottleneck in the naive approach and proposing async processing as the core optimization.
The async pipeline system uses 8 components organized into three layers: an API layer that accepts and validates requests, a Kafka event stream for async decoupling, and channel-specific workers for delivery. The components are: Client, API Gateway, Load Balancer, NotifyService, UserPrefsCache (Redis), TemplateDB (PostgreSQL), NotifyStream (Kafka), PushWorker, and EmailWorker.
All traffic enters through the API Gateway (Amazon API Gateway, REST mode), which authenticates service-to-service JWT tokens (~3ms) and enforces per-service rate limits (15K RPS cap). The Load Balancer (AWS ALB) distributes requests across 8 NotifyService pods using round-robin. The LB supports 25K RPS — 2.5x headroom above the 10K peak, ensuring the load balancer is never the bottleneck.
NotifyService is the orchestration layer running on AWS ECS Fargate (4 vCPU, 8 GB per pod, 8 pods with 100 threads each = 800 concurrent threads). For notification sends, it executes a three-step pipeline: (1) check user preferences in UserPrefsCache (Redis) — 2ms cache hit, 5ms cache miss — to enforce opt-out, quiet hours, and channel selection; (2) fetch and render the notification template from TemplateDB (PostgreSQL) with the caller's variables, substituting {{variable}} placeholders with actual values; (3) publish the rendered notification event to NotifyStream (Kafka) with the appropriate priority partition based on the notification type (transactional vs marketing). The API returns 202 Accepted in approximately 50ms. For status queries, it reads the delivery_log table from PostgreSQL read replicas.
UserPrefsCache (Amazon ElastiCache for Redis, 2 nodes, 13 GB each) stores per-user notification preferences with a key pattern prefs:{user_id}. Each entry contains enabled channels, quiet hours window, and opt-out flag. The 98% cache hit rate keeps the average lookup under 2ms. Preferences change infrequently (days/weeks), so the 1-hour TTL rarely causes cache misses for active users. Cache invalidation happens on explicit preference updates via cache-aside write-through. The Zipfian access pattern (alpha 0.99) reflects that a small set of active users receive most notifications.
TemplateDB (Amazon RDS PostgreSQL, db.r7g.large, 2 read replicas) stores two tables. The templates table is small and read-heavy (~1K rows) with high buffer cache hit rate — fetched on every notification send for template rendering. The delivery_log table is append-only, growing at approximately 10K rows per second at peak traffic. PushWorker and EmailWorker write delivery status (sent, delivered, failed) after each delivery attempt. Two read replicas serve status queries without loading the primary, with ~5ms replication lag providing acceptable eventual consistency for status reads.
NotifyStream (Amazon MSK, kafka.m7g.large) is the critical decoupling layer between API acceptance and delivery. 16 partitions with priority separation: partitions 0-3 are reserved for transactional notifications (auth codes, payment alerts) with dedicated worker capacity that is never shared with marketing workloads. Partitions 4-15 handle bulk marketing notifications with separate consumer groups. Messages carry the fully rendered content (not template references), so workers do not need TemplateDB access. Partitioned by recipient_id for per-user delivery ordering. Message retention is set to 7 days, enabling replay if consumers crash.
PushWorker (10 ECS Fargate instances, 2 vCPU/4 GB each) consumes from Kafka and delivers push notifications via FCM for Android and APN for iOS (~200ms per external call). EmailWorker (4 instances, 1 vCPU/2 GB each) delivers email via SendGrid/SES (~500ms per batch, batching up to 10 emails per API call for efficiency). Both workers implement idempotent delivery using a notification_id + recipient dedup key in Redis with 24-hour TTL, retry with exponential backoff (initial 1s, max 5 minutes, 3 retries), and DLQ for permanent failures such as invalid FCM tokens or bounced emails.
This sequence diagram shows the two-phase flow: (1) the API path where NotifyService validates, checks preferences, renders templates, and publishes to Kafka in ~50ms; and (2) the async delivery path where workers consume from Kafka and deliver via third-party APIs. The critical improvement over the naive approach is that the API thread is freed in ~50ms instead of 300ms.
Step-by-Step Walkthrough
Pseudocode
// API path — returns 202 in ~50ms
async function sendNotification(template_id, recipients, channels, variables, priority):
// 1. Check user preferences
for each recipient in recipients:
prefs = await redis.get("prefs:" + recipient) // ~2ms
if prefs.opt_out: skip
if inQuietHours(prefs): skip
// 2. Render template
template = await db.query("SELECT * FROM templates WHERE template_id = $1", [template_id])
rendered = renderTemplate(template.body_template, variables)
// 3. Publish to Kafka with priority routing
partition = priority == "transactional" ? hashToRange(0, 3) : hashToRange(4, 15)
await kafka.publish("notification-events", {
notification_id: generateUUID(),
recipients, channels, rendered_content: rendered, priority
}, { partition }) // ~5ms
return { status: 202, notification_id }
// Worker path — async delivery with retry
async function deliverPush(event):
if await redis.exists("dedup:" + event.notification_id + ":" + event.recipient_id):
return // Already delivered (idempotent)
result = await fcm.send(event.recipient_id, event.rendered_content) // ~200ms
await redis.setex("dedup:" + event.notification_id + ":" + event.recipient_id, 86400, "1")
await db.execute("INSERT INTO delivery_log (...) VALUES (...)", [...])Choice
Publish notification events to Kafka; deliver asynchronously via background workers
Rationale
Synchronous delivery holds threads for 200-500ms per request, capping throughput at ~666 RPS. Kafka decouples acceptance from delivery: the API publishes in ~5ms and returns 202, while workers deliver in the background. This increases API throughput by 6x and makes the system resilient to third-party outages — Kafka buffers during downtime.
Choice
Kafka partitions 0-3 for transactional, 4-15 for marketing, with dedicated worker capacity
Rationale
A 10M-recipient marketing blast fills the Kafka topic for minutes. Without priority separation, a time-critical auth code waits behind millions of marketing messages. Priority partitions with dedicated worker capacity ensure transactional notifications are delivered within seconds regardless of marketing queue depth.
Choice
Cache user preferences (opt-out, channels, quiet hours) in Redis with 1-hour TTL
Rationale
Every notification requires a preference check at 10K/sec. Redis delivers ~2ms per lookup vs ~10ms from PostgreSQL. The 98% hit rate means only 2% of requests incur the 5ms cache miss penalty. Preferences change infrequently, so the 1-hour TTL rarely causes stale reads.
Choice
PushWorker and EmailWorker as independent consumer groups
Rationale
Push and email have different latency profiles (200ms vs 500ms), failure modes (token expiry vs rate limit), and retry strategies. Separate workers enable independent scaling and isolate channel-specific outages. Push needs more workers (higher volume, lower latency) while email workers can batch.
Choice
Render templates before publishing to Kafka; messages carry final content
Rationale
Rendering at API time means Kafka messages contain the final rendered content, not template references. Workers do not need TemplateDB access, simplifying worker code. Template changes do not affect in-flight notifications. The trade-off is additional CPU load on NotifyService at API time.
Choice
Deduplication key: notification_id + recipient in Redis with 24h TTL
Rationale
Kafka consumer offset commits can fail, causing message replay. Without idempotency, a user receives duplicate notifications. Keying by notification_id + recipient in a Redis dedup set prevents duplicate delivery. The 24h TTL ensures the dedup set does not grow unbounded.
Target RPS
10K+ notifications/sec (API acceptance)
Latency (p99)
<200ms API response (202 Accepted), <5s push, <30s email
Storage
~500 GB/year (PostgreSQL + Kafka retention)
Availability
99.9% (retry + DLQ + replicated components)
Stores notification templates with per-channel message bodies containing {{variable}} placeholders. Read-heavy (fetched on every notification send), write-sparse (~1K rows). High buffer cache hit rate.
Indexes: PRIMARY KEY (template_id)
Small table (~1K rows). Templates are rendered at API time so Kafka messages carry final content.
Append-only log tracking delivery status per notification-recipient-channel. Written by PushWorker and EmailWorker after each delivery attempt. Read for status queries via read replicas. Indexed by notification_id.
Indexes: idx_delivery_notification ON (notification_id), idx_delivery_status ON (status, created_at)
Grows at ~10K rows/sec at peak. 2 read replicas for status queries. Eventual consistency acceptable for status reads.
Carries rendered notification events from NotifyService to channel-specific delivery workers. Priority partitioning: partitions 0-3 for transactional (auth codes, payment alerts), partitions 4-15 for marketing. Partitioned by recipient_id for per-user ordering. ~10K events/sec at peak.
Key Schema
recipient_id: string (partition key for per-user ordering)
Value Schema
{ notification_id: string, recipient_id: string, channel: string, rendered_content: string, priority: string (transactional | marketing) }
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Synchronous Send) | T1 | 350-800ms API response | ~500 RPS (thread pool ceiling) | $300/month (single DB + 4 pods) | Low — no queue, no workers | 98% (no retry, single DB) |
| Async Pipeline (Priority Queue) | T2 | <200ms API (202 Accepted) | 10K+ RPS API, 10K delivery/sec | $2,000/month (Kafka + Redis + workers) | Medium — Kafka, Redis, workers | 99.9% (retry + DLQ + replication) |
| Multi-Channel Fanout | T3 | <200ms API, <200ms in-app | 15K+ RPS API, 50K delivery/sec | $5,000/month (2x Kafka + WebSocket + Redis) | High — 12 components, WebSocket | 99.9% (per-channel retry + DLQ) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
In the synchronous approach, each thread is held for ~300ms (average third-party call). With 200 threads, throughput is 200/0.3 = ~666 RPS. In the async approach, threads are held for ~50ms (preference check + template render + Kafka publish). With the same 200 threads, throughput is 200/0.05 = 4,000 RPS. With 8 pods and 100 threads each (800 threads), throughput reaches 16,000 RPS. The 6x improvement comes from reducing per-request hold time from 300ms to 50ms.
Kafka partitions are consumed independently by consumer groups. Transactional notifications (auth codes, payment alerts) go to partitions 0-3 with dedicated workers that are never assigned to marketing partitions. A 10M marketing blast fills partitions 4-15 but does not affect partitions 0-3. Transactional workers process their partitions at full speed regardless of marketing queue depth. This guarantees sub-5-second delivery for critical notifications.
If Kafka is unavailable, NotifyService fails to publish and returns a 503 to the caller. However, Kafka is designed for 99.99% availability with multi-AZ replication. In practice, Kafka partitions may become temporarily unavailable during leader elections (typically under 5 seconds). NotifyService can buffer events in memory for short outages or fall back to synchronous delivery for transactional notifications during extended outages.
SQS would work for basic async delivery but lacks two features critical for notification systems: (1) partition-based priority separation — SQS has FIFO queues but not partition-level consumer group assignment; (2) message replay — Kafka retains messages for replay if a consumer crashes mid-batch. SQS deletes messages on acknowledgment. For notification systems needing priority and replay, Kafka is the better fit. SQS is simpler for lower-scale systems without priority requirements.
Workers write delivery status (sent, delivered, failed) to the delivery_log table in PostgreSQL after each attempt. The caller can poll GET /api/v1/notifications/{id}/status to check delivery progress. Status queries read from PostgreSQL read replicas to avoid loading the primary. The status transitions from pending (accepted by API) to sent (delivered to third party) to delivered (third-party confirmation) or failed (DLQ after all retries exhausted).
Sign in to join the discussion.
Ready to design your own Notification System?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator