Medium10 componentsInterview: High

Email System — Transactional & Bulk Delivery

Design a high-throughput email delivery pipeline with per-stage queues, template rendering, suppression list enforcement, and delivery tracking.

Message QueuePipelineDeliverabilityEvent-Driven

Try in Simulator

Problem Statement

Email delivery systems are a staple of system design interviews because they combine asynchronous processing, pipeline architecture, compliance requirements, and scale challenges into a single problem. Interviewers expect candidates to reason about the difference between transactional emails (password resets, order confirmations) and bulk campaigns (marketing newsletters), and how a single architecture can handle both workloads with very different traffic patterns and latency requirements.

At production scale, services like Amazon SES, Mailchimp, and SendGrid process over one billion emails per day. Transactional traffic runs at a steady 12,000 messages per second, while bulk campaign bursts can spike to 165,000 messages per second for 10-minute windows when a large campaign is triggered. The system must absorb these bursts gracefully without dropping messages or degrading transactional email latency, which users expect to arrive within seconds.

The core challenges include building a multi-stage processing pipeline where each stage scales and retries independently, implementing suppression list checks for CAN-SPAM and GDPR compliance, maintaining sender reputation with ISPs through proper bounce and complaint handling, rendering dynamic templates with per-recipient personalization at high throughput, and tracking the delivery lifecycle of every message from submission through delivery or bounce. A poorly designed email system risks blacklisting by major ISPs, which can take weeks or months to recover from.

Architecture Overview

The architecture implements a multi-stage pipeline with per-stage Kafka queues, enabling each processing step to scale, retry, and apply backpressure independently. The three stages are: submit (API acceptance and validation), render (template hydration with per-recipient variables), and send (suppression check followed by SMTP delivery). This separation is the standard pattern used by SES, SendGrid, and Mailchimp.

The request flow starts when an application client submits an email or campaign via the REST API. The API Gateway authenticates the request and enforces per-tenant rate limits. The SubmitService validates the payload, generates a unique message ID, returns 202 Accepted immediately (async acknowledgment), and publishes the message to the SubmitStream Kafka topic. For bulk campaigns, the SubmitService fans out the recipient list into individual messages. This decouples API response time from actual delivery, keeping submit latency under 50ms even during campaign bursts.

RenderWorker consumes from SubmitStream, fetches the email template from TemplateCache (Redis with 98% hit rate), hydrates it with per-recipient variables, and publishes the rendered email to RenderStream. The final stage, SendWorker, consumes rendered emails, checks each recipient against the SuppressionCache (a Bloom filter of 10 billion suppressed addresses using 12GB of Redis memory), and delivers via SMTP if the address is not suppressed. Every state transition is recorded in EventDB (DynamoDB) for full delivery lifecycle tracking. The per-stage queue design means a slow ISP during delivery does not block template rendering, and a rendering bug does not prevent already-rendered emails from being delivered.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Multi-Stage Pipeline

Choice

Three separate processing stages with independent Kafka queues

Rationale

Each stage has fundamentally different failure modes and resource profiles. Template rendering is CPU-bound at 10ms per email, SMTP delivery is I/O-bound at 50ms per email and dependent on receiving ISP responsiveness, and suppression lookup is memory-bound. Separating them means a slow ISP cannot block rendering, and each stage retries independently without re-processing earlier stages.

Async Submit with 202 Accepted

Choice

Immediate acknowledgment with asynchronous delivery pipeline

Rationale

SMTP delivery takes 50-500ms depending on the receiving ISP. At 165K emails per second during a campaign burst, synchronous delivery would require over 80K concurrent connections just for SMTP wait time. The async pipeline decouples the API response from delivery, keeping submit latency under 50ms and allowing the submit tier to handle burst traffic without blocking.

Bloom Filter Suppression

Choice

Redis-backed Bloom filter for 10 billion suppressed addresses

Rationale

A database lookup per email at 165K per second would require 165K read IOPS, which is expensive and adds significant latency. The Bloom filter provides O(1) lookup in approximately 1ms using 12GB of memory. The 0.1% false positive rate is acceptable for compliance purposes because it errs on the side of not sending, which protects sender reputation.

DynamoDB for Event Tracking

Choice

On-Demand DynamoDB partitioned by message ID

Rationale

Every email generates multiple lifecycle events (submitted, rendered, sent, delivered, bounced, opened, clicked), producing over 12K write events per second sustained. DynamoDB's on-demand mode handles this write-heavy workload without capacity planning, and its partition-key access pattern perfectly matches per-message status queries that clients use for delivery tracking.

Scale & Performance

Target RPS

165,000 peak (campaign burst); 12,000 sustained (transactional)

Latency (p99)

< 50ms submit (API); < 5s end-to-end delivery (p99)

Storage

Delivery events in DynamoDB; 12GB Bloom filter for suppression

Availability

99.99% API uptime; > 99% inbox placement (deliverability)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why use a multi-stage pipeline instead of a single email processing service?

Each stage of email processing has different scaling needs, failure modes, and retry semantics. Template rendering is CPU-bound and fails on template syntax errors, while SMTP delivery is I/O-bound and fails on ISP-level issues like rate limiting or temporary outages. Separating these stages means a rendering failure retries only rendering without re-submitting, and a slow ISP does not block the rendering of other emails. This is how production email platforms like SES and SendGrid are architected.

How does the system handle a bulk campaign of 100 million emails?

When a campaign is submitted, the SubmitService fans out the recipient list into individual messages published to the SubmitStream Kafka topic at up to 165K messages per second. Kafka absorbs this burst and buffers messages while downstream workers process at their own pace. RenderWorker fleet of 40 instances hydrates templates at 160K per second, and SendWorker fleet of 80 instances delivers via SMTP at 128K per second. The entire 100 million email campaign completes in approximately 10-15 minutes.

What is a suppression list and why is it critical for email delivery?

A suppression list contains email addresses that should never receive messages, including hard bounces (invalid addresses), unsubscribes, and spam complaint reporters. Sending to suppressed addresses damages sender reputation with ISPs and can lead to IP blacklisting, which affects deliverability for all emails. It is also a legal requirement under CAN-SPAM and GDPR. The system checks every recipient against a Bloom filter of 10 billion suppressed addresses before SMTP delivery, taking approximately 1ms per check.

How are hard bounces and soft bounces handled differently?

Hard bounces indicate a permanently invalid address, such as a nonexistent mailbox or domain. The address is immediately added to the suppression list and no further delivery attempts are made. Soft bounces indicate a temporary condition like a full mailbox or a receiving server outage. The system retries soft bounces with exponential backoff, making up to three attempts over 72 hours. If all three attempts fail, the address is promoted to hard bounce status and permanently suppressed.

Why does the system return 202 Accepted instead of waiting for delivery confirmation?

SMTP delivery to the receiving ISP takes 50-500ms per email, and the receiving server may impose rate limits or temporary deferrals. Waiting for delivery confirmation synchronously would tie up API threads for hundreds of milliseconds, making it impossible to handle 165K submissions per second during a campaign burst. The 202 Accepted response acknowledges that the email has been reliably enqueued in Kafka, and clients can track the delivery lifecycle asynchronously via the status query API.

Related Templates

Notification System Chat System Logging Pipeline

Discussion

Ready to design your own Email System?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator