Hard12 componentsInterview: Very High

Ad Click Aggregator — Lambda Architecture (Speed + Batch Layers)

Q: Why not just use stream processing with exactly-once semantics?

Kafka Streams and Flink both offer exactly-once processing guarantees, but these guarantees have caveats: they depend on idempotent producers, transactional consumers, and checkpoint-based recovery. Under extreme conditions (network partitions, consumer group rebalances, clock skew), exactly-once can degrade to at-least-once. For billing, 'usually exactly-once' is insufficient — the batch layer provides a mathematical guarantee by reprocessing the complete immutable event log.

Q: How does the serving layer handle the reconciliation switchover?

The Query Service checks a metadata table that tracks which hourly windows have been reconciled. For reconciled windows, it reads from the Billing Database (exact counts). For un-reconciled windows, it reads from the Real-time Database (approximate counts) and labels the values as 'preliminary.' When a batch job completes, it updates the metadata table, and subsequent queries automatically switch to the reconciled data. The switchover is transparent to dashboard users except for the label change.

Q: What is the typical divergence between speed and batch layer counts?

In normal operation, the speed layer's approximate counts diverge from the batch layer's exact counts by 0.1-0.5%. The main sources of divergence are: (1) late-arriving events that miss the speed layer's grace period, (2) window boundary effects where a click falls in different windows, and (3) duplicate events from producer retries that the speed layer counts twice but the batch layer deduplicates. At 50K clicks/hour per campaign, this translates to 50-250 click difference.

Q: When is lambda architecture overkill?

Lambda architecture is overkill when: (1) billing accuracy is not critical (internal analytics, non-financial metrics), (2) traffic volume is under 10K RPS (stream processing alone handles this), (3) the team cannot staff the operational complexity (12 components, dual pipelines), or (4) the cost of infrastructure exceeds the cost of billing errors. For most ad platforms under $100K/day in ad spend, the stream processing variant with periodic batch audits is sufficient.

Q: How does the batch layer handle schema evolution?

Raw events in Object Storage are stored in a schema-versioned Parquet format. The batch processor includes a schema registry that maps event versions to processing logic. When the event schema changes (e.g., adding a new field), new events use the new schema while old events retain their original schema. The batch processor handles both versions transparently, applying default values for fields added in newer schemas. This forward-compatible approach avoids reprocessing historical data when the schema evolves.

FAANG-scale ad click aggregation using dual processing paths: a real-time speed layer for dashboard freshness and a batch reconciliation layer for billing-grade accuracy. Handles 200K+ RPS peak.

Lambda ArchitectureBatch ProcessingBillingFAANG-Scale

Try in Simulator

Problem Statement

The lambda architecture for ad click aggregation exists because of an irreconcilable tension in distributed systems: you cannot simultaneously optimize for real-time freshness and billing-grade accuracy. The stream processing variant provides near-real-time dashboards but accepts approximate counts — events near window boundaries may be miscounted, late arrivals may be dropped, and consumer rebalances create brief gaps. For campaign monitoring dashboards, these approximations are acceptable. For billing — where every click represents money — they are not.

The lambda architecture resolves this by running two parallel processing paths over the same data. The speed layer (identical to the stream processing variant) produces fast approximate counts for dashboards. The batch layer periodically reprocesses the complete raw event log to produce exact counts for billing. The serving layer merges results from both paths, using the speed layer for recent data and the batch layer for reconciled historical data.

This is the architecture used by the largest ad platforms in the world. Google's ad billing system uses a similar dual-path approach with Millwheel (speed) and Flume (batch). Meta's ad analytics pipeline uses streaming aggregation for real-time dashboards and periodic MapReduce jobs for billing reconciliation. The pattern appears at FAANG scale because the cost of billing errors at millions of dollars per day in ad spend makes the operational complexity of dual processing paths worthwhile.

The key design challenge is managing the divergence between the speed and batch layers. During the window between batch reconciliation runs (typically hourly), the speed layer's approximate counts and the batch layer's exact counts may differ by 0.1-2%. The serving layer must handle this gracefully — typically by showing the speed layer's count with a freshness indicator and updating to the batch layer's count when reconciliation completes. This reconciliation dance is a rich source of interview discussion: How do you handle the switchover? What if the batch layer finds more clicks than the speed layer? How do you communicate billing corrections to advertisers?

This template is the most complex variant (12 components) and targets the FAANG-level interview discussion. Candidates are expected to articulate the dual-path rationale, explain the reconciliation strategy, reason about the cost of operating two full processing pipelines, and identify when the added complexity is justified versus when the simpler stream processing approach is sufficient.

Architecture Overview

The lambda architecture ad click aggregator uses 12 components organized into three layers: an ingestion layer, parallel speed and batch processing layers, and a unified serving layer.

The ingestion layer is shared between both processing paths. Clicks arrive at an API Gateway (handling authentication, rate limiting at 200K RPS, and request validation), pass through a Load Balancer to the Click Ingestion Service, and are written to Kafka. Critically, the ingestion service also writes raw events to Object Storage (S3-compatible) in partitioned Parquet format — this is the batch layer's input. The dual write (Kafka + Object Storage) happens asynchronously: the Kafka produce is the primary path (ack in 3-5ms), and the Object Storage write is fire-and-forget with a local buffer for batching.

The speed layer consumes events from Kafka via a real-time stream processor (identical to the Stream Processing variant). It performs 1-minute tumbling window aggregation and writes results to a Real-time Database optimized for recent queries (last 1-6 hours). The speed layer prioritizes freshness over accuracy — it counts all events it sees within the window, accepts the small error from window boundary effects, and does not handle late arrivals beyond a 30-second grace period. Dashboard queries for recent data hit this layer.

The batch layer runs periodically (hourly by default). A Batch Processor reads the raw events from Object Storage, performs a full MapReduce-style aggregation over the complete hour's data, deduplicates using exact click_id sets (no Bloom filter approximation), and writes reconciled totals to a Billing Database (durable, ACID-compliant store). The batch layer has perfect accuracy because it processes every event exactly once from the immutable event log. The trade-off is latency — results are only available after the batch job completes (10-30 minutes after the hour ends).

The Query Service acts as the serving layer, merging results from both paths. For recent data (last 1-2 hours), it reads from the Real-time Database (speed layer output). For historical data (older than 2 hours), it reads from the Billing Database (batch layer output). For the overlap period (last 1-2 hours where both layers have data), it applies a reconciliation strategy: use the speed layer's count as the displayed value with a 'preliminary' flag, and switch to the batch layer's count when the reconciliation job completes. Redis provides a cache layer for the most frequently accessed campaign dashboards.

The 12-component architecture provides maximum resilience. If the speed layer fails, dashboards degrade to hourly-refresh (batch data only). If the batch layer fails, dashboards continue with approximate real-time data. If Kafka fails, the Object Storage writer continues (events buffered locally), ensuring the batch layer still receives all data for reconciliation. The system handles 200K+ RPS at peak with horizontal scaling at every layer: Kafka partitions, consumer instances, Object Storage throughput, and batch processor parallelism.

The primary trade-off is operational complexity. Twelve components means twelve things to monitor, scale, and debug. Two processing pipelines double the compute cost. The reconciliation logic adds application complexity. This cost is justified when billing accuracy is worth more than the infrastructure overhead — typically at $1M+ per day in ad spend.

Architecture Preview

Loading architecture preview...

Open in Simulator

Dual-Path Processing Flow

The lambda architecture processes every click event through two independent pipelines simultaneously. The speed layer (Kafka → Stream Processor → Real-time DB) prioritizes freshness — dashboards update within 30-90 seconds. The batch layer (Object Storage → Batch Processor → Billing DB) prioritizes accuracy — exact deduplication and reconciled totals for billing.

The serving layer merges results from both paths: recent data comes from the speed layer (labeled 'preliminary'), while historical data comes from the batch layer (labeled 'reconciled'). The switchover happens transparently as batch jobs complete — the dashboard user sees the label change from 'preliminary' to 'reconciled' without any manual intervention.

Loading diagram...

Step-by-Step Walkthrough

1Browser sends POST /click — the API Gateway validates the request, applies rate limiting (200K RPS cap), and authenticates the API key (~3ms overhead for security at scale)
2Ingestion Service performs an async dual write: (a) produce to Kafka (~3ms, primary path) and (b) buffer + flush to Object Storage (Parquet format, async fire-and-forget with local buffer)
3Ingestion Service returns HTTP 202 Accepted immediately after the Kafka ack — the browser's round-trip latency is ~8ms regardless of downstream processing load
4SPEED LAYER: Stream Processor consumes from Kafka, accumulates events in 1-minute tumbling windows, and flushes aggregates to the Real-time DB — dashboards reflect clicks within 30-90 seconds
5BATCH LAYER: Every hour, the Batch Processor reads the complete hour's raw events from Object Storage (Parquet), performs exact deduplication using a full click_id set (no approximation), and writes reconciled totals to the Billing DB
6The batch job updates the reconciliation_meta table when complete, signaling that exact counts are now available for that hour
7SERVING LAYER: The Query Service checks reconciliation_meta to decide which database to read. For un-reconciled hours (last 1-2 hours), it reads from the Real-time DB and labels values as 'preliminary'. For reconciled hours, it reads from the Billing DB and labels values as 'reconciled'
8Dashboard users see values update in near-real-time with a subtle 'preliminary' badge that disappears when the batch reconciliation catches up — the switchover is automatic and transparent

Pseudocode

// INGESTION — Async dual write (Kafka + Object Storage)
async function handleClick(campaign_id, ad_id, user_id):
    event = { campaign_id, ad_id, user_id, click_id: uuid(), event_time: now() }

    // Primary path: Kafka produce (fast, durable)
    await kafka.produce("ad-clicks", key=campaign_id, value=event)  // ~3ms

    // Secondary path: buffer for Object Storage (async, fire-and-forget)
    parquet_buffer.append(event)      // Local buffer, no network call
    if parquet_buffer.size >= 10_000 or parquet_buffer.age >= 30s:
        flush_to_s3(parquet_buffer)   // Background thread, non-blocking

    return 202  // ~8ms total (gateway + Kafka ack)

// SPEED LAYER — Real-time stream processing (identical to V1)
function onWindowClose(window_start, window_end, accumulated_events):
    for campaign_id, events in group_by(accumulated_events, "campaign_id"):
        await realtime_db.upsert("realtime_aggregates", {
            campaign_id, window_start,
            click_count: events.length,
            unique_users: distinct(events, "user_id").size,
            status: "preliminary"
        })

// BATCH LAYER — Hourly reconciliation (exact dedup)
function reconcileHour(hour_bucket):
    // Read all raw events for this hour from Object Storage
    events = s3.readParquet("clicks/year=.../month=.../day=.../hour=...")

    // Exact dedup — no Bloom filter, no approximation
    seen_click_ids = new Set()
    deduped_events = []
    for event in events:
        if event.click_id not in seen_click_ids:
            seen_click_ids.add(event.click_id)
            deduped_events.append(event)

    // Write reconciled totals to Billing DB
    for campaign_id, events in group_by(deduped_events, "campaign_id"):
        await billing_db.upsert("billing_aggregates", {
            campaign_id, hour_bucket,
            exact_click_count: events.length,
            deduped_count: original_count - events.length,
            billable_amount: events.length * campaign.cpc_rate,
            status: "reconciled"
        })

    // Signal that this hour is now reconciled
    await meta_db.update("reconciliation_meta",
        { hour_bucket, batch_reconciled: true, reconciled_at: now() })

// SERVING LAYER — Merge results from both paths
async function getDashboard(campaign_id, time_range):
    results = []
    for hour in time_range:
        meta = await meta_db.get("reconciliation_meta", hour)
        if meta.batch_reconciled:
            row = await billing_db.get(campaign_id, hour)
            results.append({ ...row, freshness: "reconciled" })
        else:
            row = await realtime_db.get(campaign_id, hour)
            results.append({ ...row, freshness: "preliminary" })
    return results

Storage Architecture

The lambda architecture uses four storage tiers, each optimized for a different purpose. Object Storage is the immutable archive — the raw truth that can always be reprocessed. The Real-time DB serves dashboards with fast, approximate data. The Billing DB serves financial queries with exact, reconciled data. The reconciliation_meta table is the coordination point that tells the serving layer which source to trust for each time window.

This multi-tier design means that losing any single database is recoverable. If the Real-time DB crashes, dashboards degrade to hourly-refresh from the Billing DB. If the Billing DB crashes, it can be rebuilt from Object Storage by replaying the batch job. Object Storage itself is replicated across availability zones — it is the system's ultimate safety net.

Loading diagram...

Step-by-Step Walkthrough

1Object Storage (S3/Parquet) is the immutable archive — every raw click event is stored in columnar Parquet format, partitioned by year/month/day/hour. Parquet enables efficient columnar reads (only scan the columns needed for aggregation), reducing I/O by 5-10x compared to JSON
2The Ingestion Service writes to Object Storage asynchronously via a local buffer that flushes Parquet files every 30 seconds — this ensures the batch layer receives all events even if Kafka has an outage
3The Real-time DB stores the speed layer's approximate aggregates. Each row represents one campaign's click count for one 1-minute window, labeled with status='preliminary'. Dashboard queries for recent data (last 1-6 hours) read from here
4The Billing DB stores the batch layer's exact, deduped aggregates. Each row represents one campaign's reconciled click count for one hour. The billable_amount field is computed from the exact count × the advertiser's CPC rate — this is what invoices are generated from
5reconciliation_meta is the coordination table that bridges both paths. When a batch job completes, it sets batch_reconciled=true for that hour. The Query Service checks this table to decide whether to serve 'preliminary' (speed layer) or 'reconciled' (batch layer) data
6The data flows from left to right: Object Storage → both processing paths → both databases. The serving layer reads from right to left: check reconciliation_meta → read from Billing DB if reconciled, else Real-time DB

Pseudocode

// TIER 1: Object Storage (immutable archive)
// Write pattern: append-only Parquet files, partitioned by time
s3.putObject("clicks/year=2025/month=01/day=15/hour=12/batch_001.parquet",
    parquet_encode([
        { campaign_id, click_id, ad_id, event_time, schema_version: 2 },
        ...  // ~10K events per file, flushed every 30 seconds
    ])
)
// Storage cost: ~$0.023/GB/month. At 50K clicks/sec: ~2 TB/year raw
// Parquet compression: 5-10x smaller than JSON

// TIER 2: Real-time DB (speed layer output, approximate)
// Write pattern: batch upsert every 60 seconds (1 row per campaign per window)
INSERT INTO realtime_aggregates (campaign_id, window_start, click_count, status)
VALUES ('camp_123', '2025-01-15 12:00', 847, 'preliminary')
ON CONFLICT (campaign_id, window_start)
DO UPDATE SET click_count = excluded.click_count;

// TIER 3: Billing DB (batch layer output, exact)
// Write pattern: batch upsert every hour (1 row per campaign per hour)
INSERT INTO billing_aggregates
    (campaign_id, hour_bucket, exact_click_count, deduped_count, billable_amount, status)
VALUES ('camp_123', '2025-01-15 12:00', 50847, 312, 50847 * 0.03, 'reconciled');
// 312 duplicates removed out of 51159 raw events (0.6% dupe rate)

// TIER 4: Reconciliation meta (coordination)
UPDATE reconciliation_meta
SET batch_reconciled = true, reconciled_at = now()
WHERE hour_bucket = '2025-01-15 12:00';
// After this UPDATE, the Query Service reads from Billing DB for this hour

Key Design Decisions

Dual Processing Paths

Choice

Parallel speed layer (stream) + batch layer (MapReduce)

Rationale

The speed layer provides real-time dashboard freshness (30-90 second lag) but with approximate counts. The batch layer provides exact billing-grade counts but with hourly latency. Neither alone satisfies both requirements — real-time freshness for advertiser dashboards AND exact accuracy for billing. The dual-path approach is the standard FAANG solution, trading infrastructure cost for the ability to serve both use cases from a single data stream.

Object Storage for Raw Events

Choice

S3-compatible storage in partitioned Parquet format

Rationale

Object storage provides durable, immutable, and infinitely scalable archival of raw events. Parquet format enables efficient columnar reads for batch processing. Unlike Kafka (which has finite retention), Object Storage retains events indefinitely for audit, replay, and regulatory compliance. The batch processor reads directly from Object Storage without putting load on Kafka.

Hourly Batch Reconciliation

Choice

Full reprocessing every hour with exact deduplication

Rationale

Hourly batches balance accuracy latency with compute cost. More frequent batches (every 5 minutes) would reduce the reconciliation window but increase compute cost linearly. Less frequent batches (daily) would reduce cost but leave billing data stale for too long. The hour boundary aligns with typical advertiser reporting granularity and billing cycle expectations.

Separate Databases for Speed and Batch

Choice

Time-series DB for speed layer, ACID DB for batch layer

Rationale

The speed layer writes frequently (every minute) with high throughput but tolerates approximate data — a time-series database optimized for recent writes and time-range queries is ideal. The batch layer writes infrequently (hourly) but requires ACID guarantees for billing accuracy — a traditional relational or strongly-consistent database is appropriate. Merging both into one database would force trade-offs in either direction.

API Gateway at the Edge

Choice

Dedicated API Gateway with rate limiting and auth

Rationale

At 200K+ RPS, the ingestion endpoint is an attractive target for DDoS and click fraud. An API Gateway provides rate limiting per API key, request validation, and authentication without burdening the ingestion service. The 3ms overhead is negligible compared to the protection it provides. The Gateway also enables A/B testing of ingestion logic by routing traffic percentages to different service versions.

Scale & Performance

Target RPS

200K+ peak (FAANG-scale)

Latency (p99)

30-90s dashboard, hourly billing reconciliation

Storage

~2 TB/year (raw events + aggregates)

Availability

99.99% (dual-path redundancy)

Time & Space Complexity

Operation	Time	Space	Notes
Click ingestion (Kafka + S3 dual write)	O(1) — one Kafka produce (~3ms) + async S3 buffer append	O(1) per event — one Kafka message + one Parquet row (buffered locally)	The S3 write is amortized: events are buffered locally and flushed in 10K-event batches every 30 seconds. Per-event cost is negligible.
Speed layer aggregation (tumbling window)	O(1) amortized per event — hash map increment; O(C) per window flush	O(C × U) — campaigns × unique users in the current window	Identical to V1 stream processing. Window flush produces one row per campaign per minute.
Batch reconciliation (hourly MapReduce)	O(N log N) — sort + group by campaign_id over N events in the hour; dedup via hash set	O(N) — must hold all unique click_ids in memory for exact deduplication	At 200K RPS, one hour = ~720M events. Full dedup set requires ~23 GB memory. Parallelized across campaign_id partitions to reduce per-worker memory.
Dashboard query (serving layer merge)	O(1) for cache hit; O(log N) for DB read + O(1) reconciliation_meta check	O(W) — where W is the number of time windows in the query range	The serving layer checks reconciliation_meta to decide which database to query — adds ~2ms overhead but ensures correct freshness labels.

Database Schema (HLD)

object_storage (S3 / Parquet)

Immutable archive of every raw click event, stored as columnar Parquet files partitioned by year/month/day/hour. This is the batch layer's input — the source of truth for billing reconciliation. Parquet format enables efficient columnar reads (only scan columns needed for aggregation). Unlike Kafka (finite retention), Object Storage retains events indefinitely for audit, replay, and regulatory compliance. Write path: the ingestion service buffers events locally and flushes Parquet files every 30 seconds.

campaign_id TEXTclick_id TEXTad_id TEXTuser_id TEXTpublisher_id TEXTevent_time TIMESTAMPschema_version INT

Partition: year/month/day/hour (Hive-style partitioning)

Storage cost: ~$0.023/GB/month (S3 Standard). At 50K clicks/sec, ~2 TB/year of raw events. Parquet compression reduces this by 5-10x vs JSON.

realtime_db (Speed Layer)

Time-series optimized store for the speed layer's approximate aggregates. Written every minute by the stream processor with tumbling window results. Dashboard queries for recent data (last 1-6 hours) read from this table. Values are labeled 'preliminary' until the batch layer reconciles them. Optimized for fast time-range scans with columnar indexes.

campaign_id TEXT PKwindow_start TIMESTAMPTZ PKclick_count BIGINTunique_users BIGINTstatus TEXT ('preliminary')

Indexes: PK on (campaign_id, window_start), idx_recent ON (window_start DESC)

Speed layer counts may diverge from batch by 0.1-0.5% due to window boundaries, late arrivals, and duplicate events not yet deduped.

billing_db (Batch Layer)

ACID-compliant store for exact billing-grade aggregates. Written hourly by the batch processor after full deduplication using exact click_id sets (no Bloom filter approximation). Dashboard queries for historical data (older than 2 hours) read from this table. Each row includes the exact deduplicated count and the billable amount computed from the advertiser's CPC rate.

campaign_id TEXT PKhour_bucket TIMESTAMPTZ PKexact_click_count BIGINTdeduped_count BIGINTbillable_amount DECIMALstatus TEXT ('reconciled')reconciled_at TIMESTAMPTZ

Indexes: PK on (campaign_id, hour_bucket), idx_billing_status ON (status, reconciled_at)

Batch job processes ~180M events/hour at peak. Full dedup via exact click_id set requires ~5.7 GB memory for 180M unique IDs.

reconciliation_meta

Tracks which hourly windows have been reconciled by the batch layer. The Query Service checks this table to decide whether to serve data from the speed layer (preliminary) or batch layer (reconciled). When a batch job completes, it updates this table, and subsequent queries automatically switch to reconciled data. This is the key to the lambda architecture's serving layer merge logic.

hour_bucket TIMESTAMPTZ PKspeed_available BOOLEANbatch_reconciled BOOLEANreconciled_at TIMESTAMPTZ

Switchover is transparent to dashboard users except for a label change from 'preliminary' to 'reconciled'.

Event Contracts

ClickEventad-clicks

Raw ad click events produced by the Ingestion Service. Dual-written to both Kafka (speed layer input) and Object Storage (batch layer input). Partitioned by campaign_id.

Key Schema

campaign_id (STRING)

Value Schema

{ click_id: STRING, campaign_id: STRING, ad_id: STRING, user_id: STRING, publisher_id: STRING, event_time: TIMESTAMP, schema_version: INT }

SpeedLayerAggregatead-click-speed-aggregates

Preliminary aggregated results from the speed layer's 1-minute tumbling windows. Used for real-time dashboard updates. Labeled as preliminary until the batch layer reconciles.

Key Schema

campaign_id (STRING)

Value Schema

{ campaign_id: STRING, window_start: TIMESTAMP, window_end: TIMESTAMP, click_count: BIGINT, unique_users: BIGINT, status: STRING }

BatchReconciliationCompletead-click-reconciliation

Emitted when the hourly batch processor completes reconciliation for a time window. Signals the serving layer to switch from speed layer data to exact batch data for the reconciled hour.

Key Schema

hour_bucket (STRING)

Value Schema

{ hour_bucket: TIMESTAMP, campaigns_reconciled: INT, total_clicks: BIGINT, total_deduped: BIGINT, reconciled_at: TIMESTAMP }

What-If Scenarios

Batch reconciliation job fails mid-way through hourly processing

Impact

The reconciliation_meta table is not updated for that hour. The serving layer continues to serve speed layer data (preliminary) for the affected hour. Billing data for that hour is delayed. If the failure is not detected, the next batch job may skip the failed hour, leaving a permanent gap in billing-grade data.

Mitigation

Implement checkpoint-based recovery: the batch job writes progress checkpoints to DynamoDB. On retry, it resumes from the last checkpoint rather than restarting. Alert on batch job failure with PagerDuty integration. Idempotent writes to the Billing DB ensure retries do not double-count.

Speed and batch layers diverge by >2% for a high-spend campaign

Impact

The advertiser's real-time dashboard shows a click count that differs significantly from their billing invoice. This creates confusion, advertiser complaints, and potential contract disputes. At $0.50 CPC with 1M daily clicks, a 2% divergence means $10K billing discrepancy per day.

Mitigation

Automated divergence monitoring: compare speed and batch layer counts per campaign after each reconciliation. Alert if divergence >1%. Investigate root cause (late arrivals, consumer rebalance, duplicate events). The serving layer displays a disclaimer during the preliminary window.

Object Storage (S3) becomes temporarily unavailable during peak traffic

Impact

The Ingestion Service's local Parquet buffer fills up. If S3 is down for >5 minutes at 200K RPS, the local buffer (~60M events, ~20 GB) may exhaust disk. The speed layer is unaffected (Kafka is still operating). The batch layer misses events for the duration of the S3 outage, causing the hourly reconciliation to be incomplete.

Mitigation

Configure the Ingestion Service with a large local buffer (100 GB SSD). Implement S3 write retries with exponential backoff. If the buffer fills, fall back to writing to a secondary S3 bucket or EBS volume. Post-recovery, replay missed events from Kafka (7-day retention) to backfill S3.

API Gateway rate limiting triggers during a legitimate traffic spike (Super Bowl ad campaign)

Impact

Legitimate click events above the 200K RPS rate limit are rejected with 429 errors. Rejected clicks are permanently lost — they were never written to Kafka or S3. The advertiser sees fewer clicks than expected, leading to under-billing and revenue loss for the ad platform.

Mitigation

Implement adaptive rate limiting that auto-scales the limit based on authenticated API key tier. Premium advertisers get higher limits. Add a spillover queue (SQS) for rate-limited requests that can be processed after the spike subsides. Pre-provision capacity for known large events.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
Kafka cluster	Full cluster outage (all brokers down)	Speed layer ingestion stops completely. Click events queue in the Ingestion Service's memory buffer (limited). The S3 writer may still operate via local buffering, preserving data for the batch layer. Dashboards go stale as no new speed layer data arrives.	Multi-AZ MSK deployment with min.insync.replicas=2. If full outage occurs, the Ingestion Service switches to S3-only mode (batch layer continues, speed layer recovers on Kafka restart). Kafka events are replayed from S3 post-recovery.
Batch Processor	OOM during large-hour deduplication	The batch job crashes when the click_id dedup set exceeds available memory. The affected hour is not reconciled — the serving layer continues showing preliminary data. Billing reports for that hour are delayed until the job is fixed and retried.	Partition the dedup job by campaign_id so each worker handles a subset. Use external sort on S3 for campaigns with >10M clicks/hour. Size batch processor instances at 2x expected peak memory (r7g.2xlarge with 64 GB for 720M events/hour).
Real-time Database (speed layer)	Primary failover during window flush	The stream processor's batch INSERT fails. In-memory window state is retained and retried on the next flush cycle. Dashboard queries return stale data (from the last successful flush) during failover. No data loss since events remain in Kafka.	RDS Multi-AZ with 30-second automatic failover. Stream processor retries flush with exponential backoff. Query Service serves cached data from Redis during DB failover.
API Gateway	Misconfigured rate limit blocks all traffic	100% of click events rejected with 429 errors. Both speed and batch layers receive zero events. All clicks during the misconfiguration window are permanently lost. Dashboards go stale; billing data has a gap.	Canary deployment for rate limit configuration changes — apply to 5% of traffic for 10 minutes before full rollout. Automated rollback if error rate exceeds 5%. Store rate limit configs in versioned S3 with one-click revert.

Scaling Strategy

Ingestion: auto-scale 8 → 32 pods on CPU >60%. Kafka: 5 → 10 brokers with partition count increase (32 → 128). Stream Processor: match pod count to Kafka partition count for maximum parallelism. Batch Processor: scale horizontally via EMR auto-scaling (add worker nodes for larger hours). S3: infinitely scalable with no provisioning. API Gateway: auto-scales to any throughput. The ceiling is approximately 500K RPS with 128 Kafka partitions and proportional consumer scaling. Beyond this, consider multi-region deployment with regional Kafka clusters and cross-region batch reconciliation.

Monitoring & Alerting

Key metrics: (1) Speed-to-batch divergence per campaign — alert if >1% after reconciliation. (2) Batch job duration — alert if hourly job takes >30 minutes (normal is 10-20 minutes). (3) Batch job completion — alert if any hour is un-reconciled for >2 hours. (4) Kafka consumer lag (speed layer) — alert if >10K messages for 5 minutes. (5) S3 write buffer size — alert if local buffer exceeds 50 GB. (6) API Gateway 429 rate — alert if >0.1% of requests are rate-limited. (7) Reconciliation_meta freshness — alert if last reconciled hour is >90 minutes behind current time. Dashboard: Grafana with speed vs batch comparison charts per campaign, batch job progress timeline, Kafka lag heatmap, S3 buffer gauge, and rate limit rejection rate. SLIs: click ingestion p99 <10ms, batch reconciliation within 30 minutes of hour end, speed-batch divergence <0.5%.

Cost Analysis

At ~200K RPS target: Amazon MSK 5-broker kafka.m7g.xlarge ($1,350/month), S3 Object Storage ~2 TB/year ($50/month), Batch Processor EC2 r7g.2xlarge spot instances ($320/month), Real-time DB RDS db.r7g.xlarge ($370/month), Billing DB RDS db.r7g.large ($185/month), ElastiCache Redis ($150/month), ECS Fargate Ingestion 8 pods ($520/month), Stream Processor 32 pods ($1,040/month), Query Service 6 pods ($390/month), API Gateway ($200/month), ALB ($25/month). Total: ~$4,600/month. Cost per million clicks: ~$0.009. The dual-pipeline overhead (batch processor + billing DB) adds ~$505/month over V1, but provides billing-grade accuracy worth millions in prevented billing disputes.

Security Considerations

API Gateway: OAuth 2.0 API key authentication with per-advertiser rate limiting. Request validation rejects malformed payloads before they enter the pipeline. Kafka: SASL/SCRAM + TLS; topic-level ACLs restrict which services can produce/consume. S3: server-side encryption (SSE-S3), bucket policies restrict access to the batch processor IAM role only. Click fraud: the batch layer's exact dedup catches all duplicate click_ids, preventing billing fraud from replay attacks. IP hashing: SHA-256 applied at the Ingestion Service before events enter Kafka or S3. GDPR: S3 lifecycle policy deletes raw events after 1 year; billing data retained for 7 years per tax compliance. SOC 2 audit trail: the immutable S3 archive provides tamper-proof evidence.

Deployment Strategy

Canary deployment for the Ingestion Service and API Gateway — route 5% of traffic to the new version for 30 minutes, monitor error rate and latency, then shift to 100%. Stream Processor uses rolling Kafka consumer group deployment. Batch Processor: deploy new version and run the next hourly job with both old and new versions in parallel — compare outputs before switching. Database migrations use online DDL (pg_osc or pt-online-schema-change) to avoid downtime. Rollback: ALB target group switch for Ingestion, consumer group rollback for Stream Processor, batch job version revert in Step Functions.

Real-World Examples

•Google Ads uses Millwheel (speed layer) + Flume/MapReduce (batch layer) for the same dual-path pattern — real-time dashboards with hourly billing reconciliation
•Meta Ads Manager runs Kafka-based streaming aggregation for real-time metrics and periodic Spark jobs for billing reconciliation across billions of daily ad events
•Amazon Advertising uses Kinesis (speed) + EMR/Spark (batch) for ad click processing with S3 as the immutable archive — directly analogous to this lambda architecture
•LinkedIn Marketing Solutions uses a Kafka + Samza speed layer with Hadoop batch reconciliation for campaign analytics and billing accuracy

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Single Service + SQL)	T1	~25ms click ingestion	~500 RPS	$385/month	Low (4 components)	99% (single DB)
V1: Stream Processing (Kafka + Windowed)	T2	<5ms click ack	~50K RPS	$2,200/month	Medium (8 components)	99.9% (multi-AZ)
V2: Lambda Architecture (Speed + Batch)	T3	<8ms click ack	~200K RPS	$5,800/month	High (12 components)	99.99% (dual-path)
V3: CQRS + Event Sourcing	T4	5-8ms write	~100K RPS	$4,500/month	Very High (10 components)	99.99% (independent paths)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why not just use stream processing with exactly-once semantics?

Kafka Streams and Flink both offer exactly-once processing guarantees, but these guarantees have caveats: they depend on idempotent producers, transactional consumers, and checkpoint-based recovery. Under extreme conditions (network partitions, consumer group rebalances, clock skew), exactly-once can degrade to at-least-once. For billing, 'usually exactly-once' is insufficient — the batch layer provides a mathematical guarantee by reprocessing the complete immutable event log.

How does the serving layer handle the reconciliation switchover?

The Query Service checks a metadata table that tracks which hourly windows have been reconciled. For reconciled windows, it reads from the Billing Database (exact counts). For un-reconciled windows, it reads from the Real-time Database (approximate counts) and labels the values as 'preliminary.' When a batch job completes, it updates the metadata table, and subsequent queries automatically switch to the reconciled data. The switchover is transparent to dashboard users except for the label change.

What is the typical divergence between speed and batch layer counts?

In normal operation, the speed layer's approximate counts diverge from the batch layer's exact counts by 0.1-0.5%. The main sources of divergence are: (1) late-arriving events that miss the speed layer's grace period, (2) window boundary effects where a click falls in different windows, and (3) duplicate events from producer retries that the speed layer counts twice but the batch layer deduplicates. At 50K clicks/hour per campaign, this translates to 50-250 click difference.

When is lambda architecture overkill?

Lambda architecture is overkill when: (1) billing accuracy is not critical (internal analytics, non-financial metrics), (2) traffic volume is under 10K RPS (stream processing alone handles this), (3) the team cannot staff the operational complexity (12 components, dual pipelines), or (4) the cost of infrastructure exceeds the cost of billing errors. For most ad platforms under $100K/day in ad spend, the stream processing variant with periodic batch audits is sufficient.

How does the batch layer handle schema evolution?

Raw events in Object Storage are stored in a schema-versioned Parquet format. The batch processor includes a schema registry that maps event versions to processing logic. When the event schema changes (e.g., adding a new field), new events use the new schema while old events retain their original schema. The batch processor handles both versions transparently, applying default values for fields added in newer schemas. This forward-compatible approach avoids reprocessing historical data when the schema evolves.

Related Templates

Ad Click Aggregator — Stream Processing Ad Click Aggregator — CQRS + Event Sourcing Ad Click Aggregator — Naive

Discussion

Ready to design your own Ad Click Aggregator?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator