Design a high-throughput email delivery pipeline with per-stage queues, template rendering, suppression list enforcement, and delivery tracking.
Email delivery systems are a staple of system design interviews because they combine asynchronous processing, pipeline architecture, compliance requirements, and scale challenges into a single problem. Interviewers expect candidates to reason about the difference between transactional emails (password resets, order confirmations) and bulk campaigns (marketing newsletters), and how a single architecture can handle both workloads with very different traffic patterns and latency requirements.
At production scale, services like Amazon SES, Mailchimp, and SendGrid process over one billion emails per day. Transactional traffic runs at a steady 12,000 messages per second, while bulk campaign bursts can spike to 165,000 messages per second for 10-minute windows when a large campaign is triggered. The system must absorb these bursts gracefully without dropping messages or degrading transactional email latency, which users expect to arrive within seconds.
The core challenges include building a multi-stage processing pipeline where each stage scales and retries independently, implementing suppression list checks for CAN-SPAM and GDPR compliance, maintaining sender reputation with ISPs through proper bounce and complaint handling, rendering dynamic templates with per-recipient personalization at high throughput, and tracking the delivery lifecycle of every message from submission through delivery or bounce. A poorly designed email system risks blacklisting by major ISPs, which can take weeks or months to recover from.
The architecture implements a multi-stage pipeline with per-stage Kafka queues, enabling each processing step to scale, retry, and apply backpressure independently. The three stages are: submit (API acceptance and validation), render (template hydration with per-recipient variables), and send (suppression check followed by SMTP delivery). This separation is the standard pattern used by SES, SendGrid, and Mailchimp.
The request flow starts when an application client submits an email or campaign via the REST API. The API Gateway authenticates the request and enforces per-tenant rate limits. The SubmitService validates the payload, generates a unique message ID, returns 202 Accepted immediately (async acknowledgment), and publishes the message to the SubmitStream Kafka topic. For bulk campaigns, the SubmitService fans out the recipient list into individual messages. This decouples API response time from actual delivery, keeping submit latency under 50ms even during campaign bursts.
RenderWorker consumes from SubmitStream, fetches the email template from TemplateCache (Redis with 98% hit rate), hydrates it with per-recipient variables, and publishes the rendered email to RenderStream. The final stage, SendWorker, consumes rendered emails, checks each recipient against the SuppressionCache (a Bloom filter of 10 billion suppressed addresses using 12GB of Redis memory), and delivers via SMTP if the address is not suppressed. Every state transition is recorded in EventDB (DynamoDB) for full delivery lifecycle tracking. The per-stage queue design means a slow ISP during delivery does not block template rendering, and a rendering bug does not prevent already-rendered emails from being delivered.
Choice
Three separate processing stages with independent Kafka queues
Rationale
Each stage has fundamentally different failure modes and resource profiles. Template rendering is CPU-bound at 10ms per email, SMTP delivery is I/O-bound at 50ms per email and dependent on receiving ISP responsiveness, and suppression lookup is memory-bound. Separating them means a slow ISP cannot block rendering, and each stage retries independently without re-processing earlier stages.
Choice
Immediate acknowledgment with asynchronous delivery pipeline
Rationale
SMTP delivery takes 50-500ms depending on the receiving ISP. At 165K emails per second during a campaign burst, synchronous delivery would require over 80K concurrent connections just for SMTP wait time. The async pipeline decouples the API response from delivery, keeping submit latency under 50ms and allowing the submit tier to handle burst traffic without blocking.
Choice
Redis-backed Bloom filter for 10 billion suppressed addresses
Rationale
A database lookup per email at 165K per second would require 165K read IOPS, which is expensive and adds significant latency. The Bloom filter provides O(1) lookup in approximately 1ms using 12GB of memory. The 0.1% false positive rate is acceptable for compliance purposes because it errs on the side of not sending, which protects sender reputation.
Choice
On-Demand DynamoDB partitioned by message ID
Rationale
Every email generates multiple lifecycle events (submitted, rendered, sent, delivered, bounced, opened, clicked), producing over 12K write events per second sustained. DynamoDB's on-demand mode handles this write-heavy workload without capacity planning, and its partition-key access pattern perfectly matches per-message status queries that clients use for delivery tracking.
Target RPS
165,000 peak (campaign burst); 12,000 sustained (transactional)
Latency (p99)
< 50ms submit (API); < 5s end-to-end delivery (p99)
Storage
Delivery events in DynamoDB; 12GB Bloom filter for suppression
Availability
99.99% API uptime; > 99% inbox placement (deliverability)
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Each stage of email processing has different scaling needs, failure modes, and retry semantics. Template rendering is CPU-bound and fails on template syntax errors, while SMTP delivery is I/O-bound and fails on ISP-level issues like rate limiting or temporary outages. Separating these stages means a rendering failure retries only rendering without re-submitting, and a slow ISP does not block the rendering of other emails. This is how production email platforms like SES and SendGrid are architected.
When a campaign is submitted, the SubmitService fans out the recipient list into individual messages published to the SubmitStream Kafka topic at up to 165K messages per second. Kafka absorbs this burst and buffers messages while downstream workers process at their own pace. RenderWorker fleet of 40 instances hydrates templates at 160K per second, and SendWorker fleet of 80 instances delivers via SMTP at 128K per second. The entire 100 million email campaign completes in approximately 10-15 minutes.
A suppression list contains email addresses that should never receive messages, including hard bounces (invalid addresses), unsubscribes, and spam complaint reporters. Sending to suppressed addresses damages sender reputation with ISPs and can lead to IP blacklisting, which affects deliverability for all emails. It is also a legal requirement under CAN-SPAM and GDPR. The system checks every recipient against a Bloom filter of 10 billion suppressed addresses before SMTP delivery, taking approximately 1ms per check.
Hard bounces indicate a permanently invalid address, such as a nonexistent mailbox or domain. The address is immediately added to the suppression list and no further delivery attempts are made. Soft bounces indicate a temporary condition like a full mailbox or a receiving server outage. The system retries soft bounces with exponential backoff, making up to three attempts over 72 hours. If all three attempts fail, the address is promoted to hard bounce status and permanently suppressed.
SMTP delivery to the receiving ISP takes 50-500ms per email, and the receiving server may impose rate limits or temporary deferrals. Waiting for delivery confirmation synchronously would tie up API threads for hundreds of milliseconds, making it impossible to handle 165K submissions per second during a campaign burst. The 202 Accepted response acknowledges that the email has been reliably enqueued in Kafka, and clients can track the delivery lifecycle asynchronously via the status query API.
Sign in to join the discussion.
Ready to design your own Email System?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator