Hard4 componentsInterview: High

Email Service — Naive (Synchronous SMTP)

Q: Why is synchronous SMTP the bottleneck instead of the database?

At 100 emails/sec, the database handles only ~300 queries/sec (INSERT + UPDATE + occasional SELECT) — trivial for PostgreSQL. But each SMTP delivery takes 50-500ms, during which the thread is blocked waiting for the remote mail server. With 250 threads and 200ms average SMTP latency, the ceiling is 1,250 emails/sec theoretical. In practice, SMTP latency variance (p99 = 500ms+) and ISP throttling reduce this to 100-500/sec. The bottleneck is I/O wait, not compute or database throughput.

Q: Why does the lack of a suppression list destroy deliverability?

ISPs (Gmail, Yahoo, Outlook) track per-sending-IP bounce rates. When you send to an address that has hard-bounced (mailbox does not exist), it counts against your IP's reputation. Exceeding approximately 5% bounce rate triggers throttling or blacklisting. Without a suppression list, the service repeatedly sends to invalid addresses, accumulating bounces. Within days of sustained traffic at even modest volume, the sending IP gets blacklisted by major ISPs.

Q: What is the first optimization an interviewer expects?

Decouple the API response from SMTP delivery using a message queue (Kafka, SQS, or RabbitMQ). Instead of blocking the caller for 200ms+ of SMTP I/O, return 202 Accepted immediately and let a background worker handle SMTP delivery. This drops API latency from 200ms to 15ms and eliminates the thread pool bottleneck. The Queue-based variant implements this with Kafka and per-stage workers.

Q: How does this compare to what Amazon SES actually does?

Amazon SES uses a multi-stage pipeline with separate queues for each processing stage (validation, rendering, suppression check, DKIM signing, SMTP delivery). The API returns 202 Accepted immediately — delivery is fully async. SES maintains per-tenant sending quotas, per-IP reputation scoring, automatic IP warming, and suppression lists. The naive approach captures none of this — it is the opposite of how production email services work, which is exactly why it is useful as a baseline.

Q: Why is there no template engine in the naive approach?

Adding server-side template rendering requires a template storage layer, a variable substitution engine, and per-recipient personalization logic. The naive approach pushes this responsibility to the caller — the API expects a fully rendered HTML body in every request. For transactional emails this is workable (the caller's application renders the template), but for bulk campaigns with millions of recipients it means millions of nearly-identical API calls with megabytes of redundant HTML.

Q: What happens when an ISP temporarily rejects an email?

ISPs return 421 (temporary failure) for reasons like mailbox full, server busy, or rate limiting. In the naive approach, any SMTP error is treated as a permanent failure — the email is marked as failed in the database and an error is returned to the caller. Production email services would retry with exponential backoff (first retry in 5 minutes, then 15, then 60, up to 72 hours). The lack of retry means the naive approach silently loses emails that could have been delivered on a second attempt.

The simplest email service: a monolithic EmailService receives API requests and sends synchronously via SMTP, blocking the caller for 50-500ms per email. No queue, no suppression list, no DKIM signing. Demonstrates why synchronous SMTP is the bottleneck at scale.

EmailBeginnerBottleneck AnalysisSMTPSynchronous

Try in Simulator

Problem Statement

Designing an email delivery service is one of the most practical system design interview questions because it forces candidates to reason about the tension between synchronous simplicity and asynchronous scalability. The naive synchronous SMTP approach is where every candidate should start — it establishes the baseline that makes the improvements in queue-based and pipeline architectures measurable and concrete.

The core challenge is accepting an email via REST API and delivering it to the recipient's mail server via SMTP. In the naive approach, this is done synchronously: the EmailService opens an SMTP connection to the recipient's MX server, performs the EHLO/MAIL FROM/RCPT TO/DATA handshake, waits for the remote server to accept the message, and returns the HTTP response only after the SMTP transaction completes. This means every API call blocks for the full SMTP round-trip — typically 50-500ms depending on the receiving ISP, DNS MX lookup time, TLS negotiation, and the remote server's processing speed.

The synchronous approach has a hard throughput ceiling determined by the product of concurrent threads and SMTP latency. With 5 pods running 50 threads each (250 concurrent connections) and an average SMTP latency of 200ms, the theoretical maximum is 1,250 emails per second. In practice, SMTP latency variance (p99 exceeds 500ms when remote ISPs are slow or throttling) reduces effective throughput to approximately 100-500 emails per second before thread pool exhaustion causes cascading failures. The service simply runs out of threads waiting for SMTP responses.

Beyond the throughput ceiling, the naive approach has critical deliverability problems. There is no suppression list, so the service sends to addresses that have previously hard-bounced (invalid mailbox, domain does not exist). ISPs track per-sender-IP bounce rates and blacklist senders exceeding approximately 5%. Within days of sustained traffic, the sending IP will be blacklisted by Gmail, Yahoo, and Outlook, effectively killing deliverability. There is no DKIM/SPF/DMARC signing, so most ISPs classify emails as potential phishing and route them to spam. Without these authentication headers, deliverability drops from 99% (properly authenticated) to under 50%.

There is no retry logic — if the SMTP connection fails due to a temporary ISP issue (soft bounce, 421 response), the email is permanently lost. Production email services retry soft bounces with exponential backoff over 72 hours. The naive approach returns an error to the caller and moves on. There is no template engine, so the caller must provide fully rendered HTML in every API request, making bulk campaigns impractical.

This template makes the SMTP bottleneck visible and quantifiable. Run the simulation at increasing RPS and watch the EmailService thread pool saturate while the database sits idle. The comparison with the Queue-based and Pipeline variants provides the concrete numbers to support the discussion of async decoupling, suppression, and authentication that interviewers expect.

Email service design appears in interviews at Amazon (SES), Google (Gmail infrastructure), Microsoft (Outlook), Mailchimp, SendGrid, Twilio, and Postmark. Interviewers expect candidates to start with synchronous SMTP, identify the blocking I/O bottleneck, and propose async queue-based delivery as the first optimization — then discuss suppression lists, DKIM signing, and IP reputation management for the advanced variant.

Architecture Overview

The naive email service is a four-component linear architecture: Client, Load Balancer, EmailService, and PostgreSQL database. There is no cache, no event stream, no worker pool, and no separation between the API acceptance and SMTP delivery phases.

All traffic enters through the Load Balancer, which distributes requests across EmailService pods using round-robin. The Load Balancer adds approximately 1.5ms of routing latency and supports up to 10,000 concurrent connections — well above the system's actual ceiling, which is determined by the EmailService thread pool and SMTP latency. Both email sends and status queries flow through the same LB and service.

The EmailService is a monolithic REST API running on 5 pods with 50 threads each (250 total concurrent connections). It handles two operations: (1) send email — validate the request (from, to, subject, body_html), write a pending record to EmailDB, open a synchronous SMTP connection to the recipient's MX server, deliver the email, update the DB record with the result, and return the HTTP response; (2) status query — look up the email record by message_id and return the delivery status (pending, sent, failed). The critical bottleneck is the synchronous SMTP call in the send path. Every thread that handles a send request is blocked for 50-500ms waiting for the remote mail server to respond. This means the effective throughput is bounded by (thread_count / avg_smtp_latency_seconds).

PostgreSQL stores a single table: emails. Each email gets a row written on submission with status=pending (INSERT, ~20ms) and updated after SMTP delivery with status=sent or status=failed (UPDATE, ~15ms). The database is not the bottleneck — at 100 emails/sec, it handles only 200 queries/sec (one INSERT + one UPDATE per email), well within a single PostgreSQL instance's capacity. The bottleneck is entirely in the SMTP I/O path.

The system has no redundancy at any layer. A single PostgreSQL primary handles all reads and writes. There is no read replica, no cache, and no failover. If the database goes down, both sends and status queries fail. If SMTP connectivity is lost (DNS failure, network partition), all send threads block until timeout (30 seconds), exhausting the thread pool and causing the service to stop accepting new requests entirely.

The concrete scaling ceiling is approximately 100-500 emails per second sustained, depending on the distribution of recipient ISPs. ISPs with fast SMTP responses (Gmail averages 80ms) allow higher throughput than ISPs with slow responses (some corporate mail servers take 500ms+). At peak load, thread pool exhaustion manifests as HTTP 503 responses to new callers while all threads are blocked on SMTP I/O.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Synchronous SMTP Delivery

Choice

Block the HTTP response until SMTP delivery completes

Rationale

Synchronous SMTP is the simplest implementation — no Kafka, no workers, no consumer lag. The caller knows immediately if the email was delivered (200 = sent, 500 = SMTP failure). The cost is that every request blocks for 50-500ms of SMTP I/O, creating a hard throughput ceiling. The Queue-based variant decouples API response from SMTP delivery, returning 202 Accepted in 15ms regardless of SMTP speed.

No Suppression List

Choice

Send to any recipient address without checking bounce history

Rationale

A suppression list (hard bounces, unsubscribes, spam complaints) requires maintaining a database or Bloom filter of suppressed addresses and checking every recipient before sending. The naive approach skips this for simplicity. The consequence is that the sending IP's bounce rate climbs above 5% within days, triggering ISP blacklisting. The Pipeline variant maintains a Bloom filter of 10B suppressed addresses in Redis.

No DKIM/SPF/DMARC Authentication

Choice

Send emails without cryptographic authentication headers

Rationale

DKIM signing requires generating RSA/Ed25519 signatures for every email, SPF requires DNS TXT records, and DMARC requires alignment. Without these, receiving ISPs classify emails as unauthenticated — deliverability drops to under 50%. The Pipeline variant adds a dedicated SigningService that adds DKIM signatures to every outbound email.

Single PostgreSQL Database

Choice

One database for email records with no read replicas

Rationale

A single PostgreSQL instance handles the email record log (INSERT on send, UPDATE on delivery, SELECT on status query). At 100 emails/sec, the database handles ~300 queries/sec — trivial for PostgreSQL. The database is not the bottleneck; SMTP I/O is. Adding read replicas would not improve throughput because the constraint is thread pool saturation from SMTP blocking.

No Retry Logic

Choice

If SMTP fails, mark the email as failed and return an error

Rationale

Production email services retry soft bounces (421 — temporary failure) with exponential backoff over 72 hours, because temporary ISP issues resolve themselves. The naive approach treats all SMTP failures as permanent — a single network glitch means the email is lost. The Queue-based variant retries at the Kafka consumer level with per-stage backoff.

Scale & Performance

Target RPS

~100-500 sustained (SMTP ceiling)

Latency (p99)

200-500ms per email send (synchronous SMTP)

Storage

~250 MB/month at naive scale

Availability

~99% (single DB, no redundancy)

Time & Space Complexity

Operation	Time	Space	Notes
Send email (POST /api/v1/emails)	O(1) DB write + O(1) SMTP send (~200ms blocking I/O)	O(1) per email record (~2KB with HTML body)	SMTP latency dominates. DB INSERT is 20ms; SMTP is 50-500ms. Thread blocked for full duration.
Get status (GET /api/v1/emails/{id}/status)	O(1) primary key lookup (~5ms)	O(1) single row read	Fast indexed lookup. Not the bottleneck — status queries are 20% of traffic.
Thread pool throughput ceiling	O(threads / avg_smtp_latency)	O(threads) concurrent connections	250 threads / 0.2s = 1,250 theoretical max. Variance reduces effective to ~100-500/sec.

Database Schema (HLD)

emails

Email record log storing every email submitted through the API. Written on submission with status=pending, updated after SMTP delivery with status=sent (success) or status=failed (SMTP error). Single table, no partitioning. At ~100 emails/sec, the table grows approximately 250MB/month with indexes.

message_id UUID PK (generated on submission)sender VARCHAR NOT NULL (sender email address)recipient VARCHAR NOT NULL (recipient email address)subject VARCHAR NOT NULL (email subject line)body_html TEXT NOT NULL (rendered HTML body)status VARCHAR NOT NULL (pending/sent/failed)smtp_response_code INTEGER (250, 421, 550, etc.)created_at TIMESTAMPTZ NOT NULL (submission time)sent_at TIMESTAMPTZ (delivery completion time)

Indexes: idx_emails_sender ON (sender, created_at DESC), idx_emails_recipient ON (recipient, created_at DESC), idx_emails_status ON (status)

At ~100 emails/sec, table grows ~250MB/month. The database is not the bottleneck — SMTP I/O is.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Synchronous SMTP)	T1	200-500ms per send (SMTP blocking)	~100-500 emails/sec	$200/month (single DB + 5 pods)	Low — 4 components, linear flow	99% (no retry, no redundancy)
Queue-Based Pipeline (Kafka + Workers)	T2	<15ms API, <5s delivery (async)	12K/sec sustained, 165K/sec burst	$2,500/month (Kafka + workers + caches)	Medium — 10 components, per-stage queues	99.9% (Kafka replay, per-stage retry)
Multi-Stage Pipeline (IP Reputation + Webhooks)	T3	<15ms API, <5s transactional delivery	12K/sec trans + 165K/sec bulk	$5,000/month (dual streams, signing, webhooks)	High — 12+ components, dual delivery paths	99.9% (IP failover, auto-quarantine)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why is synchronous SMTP the bottleneck instead of the database?

At 100 emails/sec, the database handles only ~300 queries/sec (INSERT + UPDATE + occasional SELECT) — trivial for PostgreSQL. But each SMTP delivery takes 50-500ms, during which the thread is blocked waiting for the remote mail server. With 250 threads and 200ms average SMTP latency, the ceiling is 1,250 emails/sec theoretical. In practice, SMTP latency variance (p99 = 500ms+) and ISP throttling reduce this to 100-500/sec. The bottleneck is I/O wait, not compute or database throughput.

Why does the lack of a suppression list destroy deliverability?

ISPs (Gmail, Yahoo, Outlook) track per-sending-IP bounce rates. When you send to an address that has hard-bounced (mailbox does not exist), it counts against your IP's reputation. Exceeding approximately 5% bounce rate triggers throttling or blacklisting. Without a suppression list, the service repeatedly sends to invalid addresses, accumulating bounces. Within days of sustained traffic at even modest volume, the sending IP gets blacklisted by major ISPs.

What is the first optimization an interviewer expects?

Decouple the API response from SMTP delivery using a message queue (Kafka, SQS, or RabbitMQ). Instead of blocking the caller for 200ms+ of SMTP I/O, return 202 Accepted immediately and let a background worker handle SMTP delivery. This drops API latency from 200ms to 15ms and eliminates the thread pool bottleneck. The Queue-based variant implements this with Kafka and per-stage workers.

How does this compare to what Amazon SES actually does?

Amazon SES uses a multi-stage pipeline with separate queues for each processing stage (validation, rendering, suppression check, DKIM signing, SMTP delivery). The API returns 202 Accepted immediately — delivery is fully async. SES maintains per-tenant sending quotas, per-IP reputation scoring, automatic IP warming, and suppression lists. The naive approach captures none of this — it is the opposite of how production email services work, which is exactly why it is useful as a baseline.

Why is there no template engine in the naive approach?

Adding server-side template rendering requires a template storage layer, a variable substitution engine, and per-recipient personalization logic. The naive approach pushes this responsibility to the caller — the API expects a fully rendered HTML body in every request. For transactional emails this is workable (the caller's application renders the template), but for bulk campaigns with millions of recipients it means millions of nearly-identical API calls with megabytes of redundant HTML.

What happens when an ISP temporarily rejects an email?

ISPs return 421 (temporary failure) for reasons like mailbox full, server busy, or rate limiting. In the naive approach, any SMTP error is treated as a permanent failure — the email is marked as failed in the database and an error is returned to the caller. Production email services would retry with exponential backoff (first retry in 5 minutes, then 15, then 60, up to 72 hours). The lack of retry means the naive approach silently loses emails that could have been delivered on a second attempt.

Related Templates

Email Service — Queue-Based Pipeline (Kafka + Workers)Email Service — Multi-Stage Pipeline (IP Reputation + Webhooks)

Discussion

Ready to design your own Email Service?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator