Vetora logo
Hard12 componentsInterview: High

Email Service — Multi-Stage Pipeline (IP Reputation + Webhooks)

Production-grade email platform with separate transactional and bulk delivery paths, per-IP reputation scoring, DKIM/SPF/DMARC signing, bounce/complaint feedback loops, suppression list, and webhook delivery. The architecture behind Amazon SES, SendGrid, and Postmark.

EmailProduction-GradeIP ReputationDKIMWebhooksDual-Path
Problem Statement

The multi-stage pipeline with IP reputation management is the production-grade architecture used by email delivery platforms like Amazon SES, SendGrid, Postmark, and Mailchimp. It solves the two critical limitations of the queue-based approach: (1) the lack of transactional/bulk path separation that allows a bad bulk campaign to destroy transactional deliverability, and (2) the absence of per-IP reputation scoring that enables automatic detection and quarantine of degraded sending IPs.

The fundamental insight is that transactional emails and bulk marketing emails have fundamentally different requirements. Transactional emails (password resets, 2FA codes, order confirmations) are time-sensitive — a password reset that arrives 30 minutes late is useless. They must be sent from high-reputation IPs that are never contaminated by bulk campaign bounce rates. Bulk marketing emails (newsletters, promotions) are latency-tolerant but volume-heavy — sending 100 million emails in a single campaign requires careful per-ISP rate shaping to avoid triggering spam filters.

Sharing a single delivery path means a bulk campaign with a 3% bounce rate (a bad but not unusual recipient list) degrades the shared IP pool's reputation score, which then throttles or blocks transactional emails that share those IPs. A user waiting for a password reset email that never arrives because a marketing campaign poisoned the IP is the worst-case failure mode for an email platform. Separating transactional and bulk into dedicated Kafka streams with dedicated sender pools and dedicated IP pools eliminates this cross-contamination entirely.

Per-IP reputation scoring is the second critical addition. The system tracks bounce rate, complaint rate, and spam trap hits for every sending IP address. When an IP's bounce rate exceeds 5%, it is automatically quarantined — removed from the active sending pool until its metrics recover. This prevents a single bad campaign from permanently destroying an IP's reputation. IP warming schedules control how quickly new IPs ramp up volume (starting at 1K/day, doubling weekly until reaching full capacity). Without warming, a new IP sending 100K emails on day one would be immediately flagged by ISPs as suspicious.

DKIM/SPF/DMARC signing via a dedicated SigningService adds cryptographic authentication to every outbound email. DKIM (DomainKeys Identified Mail) signs the email headers and body with an Ed25519 private key, allowing receiving ISPs to verify the email was not tampered with in transit. SPF (Sender Policy Framework) declares which IPs are authorized to send on behalf of the domain. DMARC (Domain-based Message Authentication, Reporting, and Conformance) ties DKIM and SPF together with a policy for handling authentication failures. Without these, deliverability drops from 99%+ to under 50% — most ISPs classify unauthenticated emails as potential spam.

Bounce and complaint feedback loops close the reputation management loop. ISPs publish feedback loop (FBL) reports when recipients mark emails as spam. The BounceProcessor consumes these reports along with SMTP bounce notifications, updates the sending IP's reputation metrics in ReputationDB, and adds permanently-failed addresses to the SuppressionCache. This creates a self-correcting system: bad addresses are automatically suppressed, degraded IPs are automatically quarantined, and healthy IPs continue serving traffic.

Webhook delivery provides real-time event notifications to senders. Instead of requiring senders to poll an events API at high frequency, the WebhookWorker pushes delivery events (sent, bounced, complained, opened, clicked) to the sender's configured webhook URL as they occur. This reduces API polling load by 100x and enables senders to react to bounces and complaints in real time — updating their own databases, pausing campaigns with high bounce rates, or triggering re-engagement flows.

This architecture appears in senior system design interviews at Amazon (SES), Google (Gmail infrastructure), Microsoft (Outlook), Twilio (SendGrid), and any company operating email at scale. Interviewers expect candidates to articulate why transactional/bulk separation is critical, explain the IP reputation lifecycle (warming, monitoring, quarantine), and reason about the trade-offs of adding operational complexity (12+ components) for deliverability guarantees.

Architecture Overview

The multi-stage pipeline uses 12+ components organized into three layers: an API layer that accepts and classifies emails, dual delivery pipelines for transactional and bulk traffic, and a feedback layer for bounce processing and webhook delivery.

The API layer consists of the Client, API Gateway, Load Balancer, and EmailService. The API Gateway authenticates API keys (~3ms), enforces per-tenant rate limits (200K RPS cap), and routes to the Load Balancer. The EmailService (20 pods, 100 threads each = 2,000 concurrent) handles all inbound requests. For transactional emails (POST /api/v1/emails), the service validates the request, optionally renders a template from TemplateDB, and publishes to TransactionalStream (Kafka). For bulk campaigns (POST /api/v1/campaigns), it fans out the recipient list and publishes individual messages to BulkStream (Kafka). Both return 202 Accepted immediately.

The dual delivery pipelines are the core architectural innovation. TransactionalStream (Kafka, 32 partitions, partitioned by message_id) feeds TransactionalSender (20 workers). BulkStream (Kafka, 64 partitions, partitioned by recipient_domain) feeds BulkSender (100 workers). Both sender types follow the same processing flow: check SuppressionCache (Bloom filter, ~1ms), call SigningService for DKIM/SPF/DMARC headers (~5ms), deliver via SMTP (~50ms), write delivery events to ReputationDB. The critical difference is that BulkSender also reads per-IP reputation scores from ReputationDB to select sending IPs and applies per-ISP rate limits — throttling sends to ISPs where the IP's bounce rate is elevated.

TransactionalSender uses a dedicated pool of high-reputation IPs that are reserved exclusively for transactional traffic. These IPs have consistently low bounce rates (<0.5%) because transactional emails go to addresses that the user has actively used (they just logged in, placed an order, requested a password reset). BulkSender uses a separate pool of IPs, some of which may be in warming stages. If a bulk campaign degrades one of these IPs, it has zero impact on transactional deliverability.

The SigningService (10 pods, 200 threads each) centralizes DKIM key management. Ed25519 private keys are stored in AWS Secrets Manager and rotated quarterly. Both TransactionalSender and BulkSender call SigningService before every SMTP delivery. Signing adds ~5ms per email but is non-negotiable for deliverability — unsigned emails are classified as spam by most ISPs.

The feedback layer consists of BounceProcessor and WebhookWorker. BounceProcessor (10 workers) consumes SMTP bounce notifications and ISP FBL complaint reports. Hard bounces (550 — invalid address) immediately add the address to SuppressionCache and increment the sending IP's bounce counter in ReputationDB. Soft bounces (421 — temporary failure) increment a per-address counter; 3 soft bounces promote to hard bounce. IPs exceeding 5% bounce rate are auto-quarantined in ReputationDB. WebhookWorker (20 workers) reads delivery events from ReputationDB and POSTs them to tenant-configured webhook URLs with HMAC signatures for verification, retrying with exponential backoff on failure.

SuppressionCache (Redis, 6 nodes) holds the Bloom filter of 10B suppressed addresses (~12GB). ReputationDB (PostgreSQL, 32 partitions) stores per-IP reputation metrics, delivery event logs, webhook configurations, and IP warming schedules. TemplateDB (PostgreSQL, 16 partitions) stores email templates with per-tenant namespacing.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Separate Transactional and Bulk Delivery Paths

Choice

Dedicated Kafka streams and sender pools for transactional vs bulk email

Rationale

Transactional emails (password resets, 2FA) must be delivered in under 5 seconds from high-reputation IPs. Bulk campaigns take hours and may degrade IP reputation. Sharing a delivery path means a bad bulk campaign blocks and contaminates transactional traffic. Separate paths ensure transactional emails skip the bulk queue and use dedicated high-reputation IPs that are never exposed to bulk bounce rates.

Per-IP Reputation Scoring with Auto-Quarantine

Choice

Track bounce rate, complaint rate, and spam trap hits per sending IP in ReputationDB

Rationale

ISPs blacklist IPs exceeding ~5% bounce rate. With multiple sending IPs, reputation scoring enables automatic quarantine: when an IP's metrics degrade, it is removed from the active pool. Traffic shifts to healthy IPs while the quarantined IP recovers. This self-correcting system prevents a single bad campaign from permanently destroying sending capacity.

Dedicated DKIM/SPF/DMARC Signing Service

Choice

Centralized signing service called by both sender types before every SMTP delivery

Rationale

DKIM signing requires private key access and cryptographic computation. Centralizing this in a dedicated service provides single-point key management (one place to rotate keys), consistent signing across both paths, and isolated scaling of the cryptographic workload. Signing adds ~5ms per email but improves deliverability from ~50% (unsigned) to 99%+ (properly authenticated).

Bounce/Complaint Feedback Loop Processing

Choice

Dedicated BounceProcessor consuming ISP FBL reports and SMTP bounces

Rationale

ISPs publish feedback loop reports when recipients mark emails as spam. Processing these in near-real-time (within minutes) enables the system to suppress future sends to complaining addresses and quarantine IPs with rising complaint rates. Delayed processing (hours or days) means continued sending to addresses that have complained, further degrading IP reputation.

Webhook Event Delivery

Choice

Push delivery events to sender-configured webhook URLs via dedicated workers

Rationale

Senders need real-time bounce and complaint data to maintain their own recipient lists. Polling an events API requires millions of API calls per day at scale. Webhooks push events as they occur, reducing API load by 100x. WebhookWorker runs independently with its own retry queue — webhook delivery failures do not affect email delivery or bounce processing.

IP Warming Schedules

Choice

New IPs start at 1K/day and ramp up over 4-6 weeks per ReputationDB warming schedule

Rationale

A new IP with no sending history triggers ISP spam filters if it sends high volume on day one. Gradual warming (1K/day, doubling weekly) builds reputation incrementally. ReputationDB stores per-IP warming stages and daily limits. BulkSender respects these limits, routing overflow to already-warmed IPs. This enables the system to grow sending capacity by adding new IPs without triggering deliverability issues.

Scale & Performance

Target RPS

12K/sec transactional + 165K/sec bulk burst

Latency (p99)

<15ms API, <5s transactional delivery

Storage

~5 TB/year (PostgreSQL + Redis + Kafka)

Availability

99.9% (IP failover, auto-quarantine, Kafka replay)

Time & Space Complexity
OperationTimeSpaceNotes
Submit transactional email (POST /api/v1/emails)O(1) validate + O(T) template render + O(1) Kafka publish (~15ms total)O(1) per message in TransactionalStreamT = template variable count. Decoupled from SMTP. 202 Accepted in ~15ms.
Transactional delivery (TransactionalSender)O(1) suppression check + O(1) DKIM sign + O(1) SMTP send (~56ms total)O(1) per delivery event in ReputationDB1ms suppression + 5ms signing + 50ms SMTP. Uses dedicated high-reputation IPs.
Bulk delivery (BulkSender)O(1) suppression + O(1) sign + O(1) reputation lookup + O(1) SMTP (~66ms total)O(1) per delivery event in ReputationDBExtra 8ms for IP reputation lookup and rate limit check. Per-ISP rate shaping.
Bounce processing (BounceProcessor)O(1) suppression write + O(1) reputation update (~17ms total)O(1) per suppression entry, O(1) per reputation update2ms cache write + 15ms DB update. Processes ~1-2% of send volume.
Database Schema (HLD)
email_templates (PostgreSQL)

Email templates with subject, HTML body, and plain text body supporting variable substitution ({{recipient_name}}, {{order_id}}). Tenant-scoped. Read-heavy at email send time. ~100K templates at ~10KB each.

template_id UUID PKtenant_id UUID FK (owning tenant)name VARCHAR (human-readable template name)subject_template TEXT (subject with {{variables}})body_html TEXT (HTML body with {{variables}})body_text TEXT (plain text fallback)created_at TIMESTAMPTZupdated_at TIMESTAMPTZ

Indexes: idx_templates_tenant ON (tenant_id, name)

~100K templates, ~1GB total. Read-heavy with ~98% cache hit rate when fronted by Redis.

ip_reputation (PostgreSQL)

Per-sending-IP reputation metrics tracking bounce rate, complaint rate, send volume, and warming stage. Updated by BounceProcessor on every bounce and complaint. Read by BulkSender for IP selection. IPs exceeding 5% bounce rate are auto-quarantined.

ip_address INET PK (sending IP)bounce_rate NUMERIC (30-day rolling, 0.0-1.0)complaint_rate NUMERIC (30-day rolling, 0.0-1.0)emails_sent_30d INTEGER (rolling 30-day count)warming_stage VARCHAR (cold/warming/warm/quarantined)daily_limit INTEGER (current daily send cap)updated_at TIMESTAMPTZ

Indexes: idx_rep_warming ON (warming_stage, bounce_rate)

Small table (~100-500 IPs). High write frequency from BounceProcessor. Critical for routing decisions.

delivery_events (PostgreSQL)

Delivery event log for every email: submitted, signed, sent, delivered, bounced, complained, opened, clicked. Written by TransactionalSender and BulkSender. Read by WebhookWorker for event delivery and by the status API for message timeline queries.

event_id UUID PKmessage_id UUID (indexed, FK to original message)event_type VARCHAR (submitted/signed/sent/bounced/complained/delivered/opened/clicked)recipient VARCHAR (recipient email address)sending_ip INET (IP used for SMTP delivery)smtp_response VARCHAR (SMTP code + message)timestamp TIMESTAMPTZ

Indexes: idx_events_message ON (message_id, timestamp), idx_events_type_time ON (event_type, timestamp)

High write volume (12K+ events/sec). Partitioned by message_id hash across 32 partitions.

suppress:{email_hash} (Redis Bloom Filter)

Bloom filter of 10B suppressed email addresses checked by both TransactionalSender and BulkSender before every SMTP delivery. Addresses added on hard bounces, unsubscribes, and spam complaints by BounceProcessor.

email_hash BLOOM_ENTRY (SHA-256 of lowercase email)

~12GB for 0.1% false positive rate. 6-node Redis cluster. No TTL — permanent suppression unless explicit removal.

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
Naive (Synchronous SMTP)T1200-500ms per send (SMTP blocking)~100-500 emails/sec$200/month (single DB + 5 pods)Low — 4 components, linear flow99% (no retry, no redundancy)
Queue-Based Pipeline (Kafka + Workers)T2<15ms API, <5s delivery (async)12K/sec sustained, 165K/sec burst$2,500/month (Kafka + workers + caches)Medium — 10 components, per-stage queues99.9% (Kafka replay, per-stage retry)
Multi-Stage Pipeline (IP Reputation + Webhooks)T3<15ms API, <5s transactional delivery12K/sec trans + 165K/sec bulk$5,000/month (dual streams, signing, webhooks)High — 12+ components, dual delivery paths99.9% (IP failover, auto-quarantine)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why is transactional/bulk separation the most critical architectural decision?

A user waiting for a password reset email that never arrives because a marketing campaign poisoned the shared IP is the worst-case failure for an email platform. Transactional emails have a 99.9% deliverability requirement — they are sent to addresses the user just used (login, purchase, password reset). Bulk campaigns have variable quality recipient lists that may include stale addresses. Sharing IPs means a 3% bulk bounce rate contaminates the transactional IP's reputation. Separate paths with dedicated IP pools eliminate this risk entirely.

How does per-IP reputation scoring work?

ReputationDB tracks four metrics per sending IP: bounce rate (30-day rolling), complaint rate (30-day rolling), emails sent (30-day count), and warming stage (cold/warming/warm/quarantined). BounceProcessor updates these on every bounce and complaint. BulkSender reads scores before each send to select the healthiest available IP. IPs exceeding 5% bounce rate are automatically set to quarantined status, removing them from the active pool. Once metrics recover (bounce rate drops below 2% over a 7-day window), the IP is restored.

What is DKIM signing and why is it non-negotiable?

DKIM (DomainKeys Identified Mail) adds a cryptographic signature to email headers using the sender domain's private key. Receiving ISPs verify this signature against the domain's public key published in DNS. Without DKIM, ISPs cannot verify that the email was actually sent by the claimed domain — it could be a phishing attempt. Gmail, Yahoo, and Outlook all require DKIM for inbox delivery. Unsigned emails are routed to spam or rejected outright. The 5ms signing cost per email is negligible compared to the deliverability improvement.

Why does IP warming take 4-6 weeks?

ISPs assign reputation to sending IPs based on observed behavior over time. A new IP has no history, so ISPs treat it cautiously — rate limiting or spam-filtering its traffic. Gradual volume increase (1K/day week 1, 2K/day week 2, 4K/day week 3, etc.) lets the IP build a positive track record with each ISP. Rushing this process (sending 100K/day from a fresh IP) triggers spam filters because the sudden volume spike matches the pattern of compromised servers used for spam. There is no shortcut — reputation is earned through consistent, low-bounce sending over weeks.

How do webhook retries work without affecting email delivery?

WebhookWorker is architecturally isolated from the delivery pipeline. It reads events from ReputationDB (which TransactionalSender and BulkSender write to) and delivers them to tenant webhook URLs. If a webhook URL is down, events queue in the worker's retry buffer with exponential backoff (5 min, 15 min, 1 hour, max 3 attempts). Failed webhooks are logged but never block email delivery or bounce processing. The worst case is delayed notification — the sender finds out about bounces later, but email delivery is unaffected.

How does this compare to Amazon SES's actual architecture?

Amazon SES uses a similar multi-stage pipeline with separate transactional and bulk paths, per-IP reputation scoring, DKIM signing, and bounce feedback loops. The key differences are: SES supports per-tenant dedicated IPs (this template uses shared pools), SES has a global edge network for multi-region delivery (this template is single-region), and SES integrates with IAM for authentication instead of API keys. The core architecture — separate delivery paths, reputation-based IP routing, feedback loop processing — is the same pattern.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Email Service?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator