Hard10 componentsInterview: High

Ad Click Aggregator — CQRS + Event Sourcing

Q: What is the difference between CQRS and Event Sourcing?

CQRS (Command Query Responsibility Segregation) is a pattern that separates the write model from the read model. You can do CQRS without event sourcing — e.g., a write database and a read replica. Event Sourcing is a pattern that stores state changes as immutable events rather than current state. You can do event sourcing without CQRS — e.g., one event log with one projection. In practice, they are often combined because event sourcing naturally produces an append-only write model and derived read models (projections), which is exactly the CQRS separation.

Q: How long does a projection rebuild take?

Projection rebuild time depends on event volume and projection complexity. For a simple counter projection (sum clicks by campaign), the bottleneck is event log read throughput — approximately 100K events/second per consumer, so 100M events takes ~17 minutes. For a complex projection with deduplication (maintaining a set of seen click_ids), memory becomes the bottleneck — at 100M unique click_ids, the set requires ~3.2 GB of memory. Snapshots (periodic checkpoints of projection state) can reduce rebuild time to minutes by starting from the latest snapshot rather than the beginning.

Q: How does event sourcing handle schema changes?

Events in the log are versioned. When the ClickReceived event schema changes (e.g., adding a device_type field), new events use v2 while old events remain v1. Projections include a schema registry that maps event versions to processing logic. The v2 projection handler extracts device_type directly; the v1 handler applies a default or heuristic. This forward-compatible approach means old events never need to be migrated — they are processed as-is by version-aware handlers.

Q: What are the consistency guarantees between command and query sides?

The command side is strongly consistent — once a write is acknowledged, the event is durably persisted and will never be lost. The query side is eventually consistent — projections process events with a lag of 1-30 seconds depending on the projection's processing speed. This means a user might click an ad and not see the count increment for a few seconds on the dashboard. For billing, this lag is irrelevant because billing reports are generated hourly/daily. For dashboards, 1-5 second lag is acceptable.

Q: When is CQRS + Event Sourcing overkill for ad click aggregation?

CQRS + Event Sourcing is overkill when: (1) you do not need full audit trails or billing dispute resolution, (2) your team is not experienced with event-driven architectures (the learning curve is steep), (3) traffic is under 10K RPS (stream processing is simpler and sufficient), or (4) you do not anticipate needing new aggregation dimensions or projection changes over time. The Lambda Architecture variant provides similar billing accuracy with a more familiar batch processing model.

Event-sourced ad click aggregation where every click is an immutable event in a durable log. Separate command and query models enable independent scaling, full auditability, and projection replay.

CQRSEvent SourcingAuditCompliance

Try in Simulator

Problem Statement

CQRS (Command Query Responsibility Segregation) with Event Sourcing represents a fundamentally different approach to ad click aggregation — one that prioritizes auditability, replayability, and schema flexibility over raw throughput. In this architecture, the source of truth is not a database table of current state, but an immutable, append-only log of every event that ever occurred. Current state (campaign click counts, aggregated metrics) is derived from this event log through projections — materialized views that can be rebuilt from scratch at any time.

This approach is particularly compelling for ad click aggregation because of the billing accuracy requirement. When an advertiser disputes a charge ('You billed me for 1 million clicks but I only received 800K'), the system must be able to reconstruct exactly what happened: which clicks were received, when they were processed, how they were aggregated, and whether any were duplicated or missed. With a traditional database, answering these questions requires forensic log analysis. With event sourcing, the answer is definitive: replay the event log through the aggregation logic and produce the exact count.

The CQRS pattern separates the system into a command side (accepting and validating click events) and a query side (serving aggregated metrics). The command side is optimized for write throughput — appending events to a durable log is fast and simple. The query side is optimized for read patterns — projections build exactly the data structures that dashboard queries need. This separation enables independent scaling: add more command processors during traffic spikes, add more query replicas during reporting periods.

Event sourcing also provides a powerful schema evolution story. When you need a new aggregation dimension (e.g., clicks by device type, which was not originally tracked), you add the field to new events and create a new projection that processes both old and new event schemas. The projection replays the entire event log, extracting device information from IP-based heuristics for old events and from the explicit field for new events. No database migration, no ETL pipeline, no data loss — the event log is the single source of truth.

The trade-offs are real: event sourcing introduces eventual consistency between the command and query sides (seconds of lag), projection rebuilds can take hours for large event logs, and the programming model requires understanding event replay semantics (idempotent event handlers, event versioning, snapshot optimization). This template is designed for candidates who want to demonstrate advanced distributed systems knowledge and for teams building billing-critical systems where auditability is a hard requirement.

Architecture Overview

The CQRS + Event Sourcing architecture for ad click aggregation uses 10 components organized into three tiers: the command side (write path), the event store, and the query side (read path with projections).

The command side handles click ingestion. Clicks arrive at a Load Balancer and are routed to the Command Service, which performs validation (required fields, timestamp bounds, API key authentication) and writes a ClickReceived event to the Event Log. The Event Log is an append-only, durable store — conceptually similar to Kafka's compacted topic or AWS EventBridge's event bus, but implemented as a dedicated event store with sequence numbering, timestamp indexing, and partition-by-aggregate support. Each event receives a monotonically increasing sequence number and is immutable once written. The Command Service returns HTTP 202 Accepted as soon as the event is durably persisted (ack in 5-8ms).

The Event Log is the single source of truth for the entire system. Every click event ever received is stored here, indefinitely. Events are partitioned by campaign_id for processing locality and indexed by timestamp for time-range replays. The log supports two read patterns: tail reading (projections consuming new events in real time) and replay reading (rebuilding projections from the beginning or from a snapshot).

The query side consists of two independent projections, each building a different materialized view from the same event stream. The Dashboard Projector consumes events in real time and maintains a time-bucketed click count per campaign in the Dashboard Database (optimized for recent queries, time-series indexes). The Billing Projector consumes the same events but applies strict deduplication (maintaining a set of seen click_ids) and writes reconciled totals to the Billing Database (ACID-compliant, optimized for exact aggregation queries). Both projectors can be rebuilt from scratch by replaying the Event Log from the beginning — this is the key event sourcing guarantee.

A Redis cache sits in front of the Query Service, caching the most frequently accessed campaign dashboards for sub-2ms reads. The Query Service reads from the appropriate database based on the query type: dashboard queries hit the Dashboard Database (or Redis cache), billing queries hit the Billing Database. Both databases are eventually consistent with the Event Log — the Dashboard Projector has a lag of 1-5 seconds, and the Billing Projector has a lag of 5-30 seconds (it does more processing per event for deduplication).

The architecture enables several powerful operational capabilities. Projection rebuild: if you discover a bug in the Dashboard Projector's aggregation logic, you fix the code and replay the Event Log from the beginning — the Dashboard Database is rebuilt with the correct logic, no data loss. Schema evolution: when you add a new aggregation dimension, you create a new projection and replay the log — no ETL, no migration. A/B testing: run two versions of a projection side by side against the same event stream and compare outputs. Audit: any billing dispute can be resolved by replaying the Event Log through the Billing Projector and producing a deterministic, reproducible count.

Scaling is independent per tier. The Command Service scales horizontally via the Load Balancer (more pods for higher write throughput). The Event Log scales via partitioning (more partitions for higher write throughput and parallelism). Each projector scales independently — the Dashboard Projector can have more instances than the Billing Projector because dashboard freshness is more latency-sensitive. The Query Service scales based on dashboard query volume, independent of write traffic.

Architecture Preview

Loading architecture preview...

Open in Simulator

Event Sourcing Flow — Write + Read Paths

Event sourcing inverts the traditional database model: instead of storing current state and losing history, it stores every state change as an immutable event and derives current state through projections. The write path is trivially simple — validate the click and append it to the log. All 'intelligence' lives on the read path, where projections build exactly the data structures that queries need.

The most powerful capability is projection rebuild: if you discover a bug in your aggregation logic, you fix the code and replay the entire event log through the corrected projection. The Dashboard DB is rebuilt with the correct logic — no data migration, no ETL pipeline, no data loss. This is the core event sourcing guarantee that makes it compelling for billing-critical systems.

Loading diagram...

Step-by-Step Walkthrough

1COMMAND SIDE: Browser sends POST /click — the Command Service validates required fields (campaign_id, ad_id, timestamp bounds) but does NOT compute aggregates or check for duplicates. Validation only.
2The Command Service appends a ClickReceived event to the immutable Event Log. Each event receives a monotonically increasing sequence number — this ordering is crucial for deterministic replay
3The Event Log acknowledges the write in ~5ms. The Command Service returns HTTP 202 Accepted — the write path is fast because it does no computation beyond validation
4PROJECTION PROCESSING: Two independent projectors tail-read new events from the Event Log in parallel. They share no state and can be scaled independently
5The Dashboard Projector processes events with minimal overhead — it increments counters in the Dashboard DB with 1-5 second lag. It also pushes hot campaign data to Redis for sub-2ms reads
6The Billing Projector processes events with strict deduplication — it maintains a set of every click_id it has ever seen (the dedup_set table). Only the first occurrence of each click_id is counted. Processing lag is 5-30 seconds due to the dedup lookup overhead
7QUERY SIDE: Dashboard queries read from the Dashboard DB (or Redis cache). Billing queries read from the Billing DB. Both are eventually consistent with the Event Log
8PROJECTION REBUILD: If the Dashboard Projector's logic has a bug, the operator fixes the code and replays the Event Log from sequence #0 (or from the latest snapshot). The Dashboard DB is rebuilt with corrected logic — deterministic, reproducible, lossless

Pseudocode

// COMMAND SIDE — Validate + Append (simple by design)
async function handleClickCommand(campaign_id, ad_id, user_id):
    // Validate only — no aggregation, no dedup, no side effects
    if not campaign_id: throw ValidationError("campaign_id required")
    if not ad_id: throw ValidationError("ad_id required")

    event = ClickReceived {
        aggregate_id:   campaign_id,
        schema_version: 2,
        payload: { campaign_id, ad_id, user_id, click_id: uuid() },
        event_time:     now()
    }

    // Append to immutable log — gets monotonic sequence number
    seq = await event_log.append(event)   // ~5ms, durable
    return 202  // Accepted — not "processed"

// DASHBOARD PROJECTOR — Fast, approximate (1-5s lag)
class DashboardProjector:
    last_seq = 0  // Tracks position in event log

    async function processEvent(event):
        if event.type != "ClickReceived": return

        campaign_id = event.payload.campaign_id
        window = truncate_to_minute(event.event_time)

        // Upsert time-bucketed counter
        await dashboard_db.execute(
            "INSERT INTO dashboard_projection (campaign_id, window_start, click_count)
             VALUES ($1, $2, 1)
             ON CONFLICT (campaign_id, window_start)
             DO UPDATE SET click_count = click_count + 1",
            [campaign_id, window]
        )
        last_seq = event.sequence_number

        // Push to Redis for hot campaigns
        await redis.setex("campaign:" + campaign_id, 60, latest_aggregate)

// BILLING PROJECTOR — Exact dedup, ACID (5-30s lag)
class BillingProjector:
    last_seq = 0

    async function processEvent(event):
        click_id = event.payload.click_id

        // Exact dedup via click_id set (no approximation)
        exists = await dedup_db.exists("dedup_set", click_id)
        if exists: return  // Duplicate — skip silently

        await dedup_db.insert("dedup_set", { click_id, first_seen: now() })

        // Count only first occurrence
        campaign_id = event.payload.campaign_id
        hour = truncate_to_hour(event.event_time)
        await billing_db.execute(
            "INSERT INTO billing_projection (campaign_id, hour_bucket, exact_clicks)
             VALUES ($1, $2, 1)
             ON CONFLICT (campaign_id, hour_bucket)
             DO UPDATE SET exact_clicks = exact_clicks + 1",
            [campaign_id, hour]
        )
        last_seq = event.sequence_number

// PROJECTION REBUILD — The core event sourcing guarantee
async function rebuildProjection(projector, from_seq = 0):
    // Optionally start from latest snapshot
    snapshot = await snapshots_db.getLatest(projector.name)
    if snapshot:
        projector.loadState(snapshot.state_data)
        from_seq = snapshot.snapshot_seq

    // Replay every event from that point forward
    events = event_log.readFrom(from_seq)  // Iterator, not full load
    for event in events:
        projector.processEvent(event)

    // Full projection is now rebuilt with current logic
    log("Rebuilt from seq " + from_seq + " to " + projector.last_seq)

Event Store + Projections Schema

This schema embodies the event sourcing principle: the Event Log is the only table that stores original facts. Everything else — dashboard_projection, billing_projection, dedup_set, projection_snapshots — is derived and can be deleted and rebuilt from the Event Log at any time.

The last_seq_processed field in each projection table is the bookmark that tracks how far the projection has consumed from the Event Log. If a projector restarts, it resumes from its last_seq_processed rather than replaying from the beginning. Snapshots further optimize this by periodically checkpointing the entire projection state, allowing rebuilds to start from a recent snapshot rather than sequence #0.

Loading diagram...

Step-by-Step Walkthrough

1EVENT LOG: The single source of truth. Every ClickReceived event is appended with a monotonically increasing sequence_number and never modified. The aggregate_id (campaign_id) enables partition-level reads. schema_version enables forward-compatible event evolution
2DASHBOARD PROJECTION: A materialized view built by reading the Event Log. Stores time-bucketed click counts optimized for recent time-range queries. The last_seq_processed field is the projector's bookmark — on restart, it resumes from here instead of replaying everything
3BILLING PROJECTION: A materialized view with strict deduplication. The exact_clicks field counts only unique click_ids. The billable_amount field is computed from exact_clicks × CPC rate — this is what generates advertiser invoices
4DEDUP SET: The Billing Projector's deduplication state. Every click_id it has ever seen is stored here. On each new event, the projector checks this set before counting the click. Periodically trimmed: click_ids older than 7 days are removed since duplicates beyond that window are astronomically unlikely
5PROJECTION SNAPSHOTS: Periodic checkpoints of projection state at a given sequence number. Instead of replaying from seq#0 (hours for large logs), a rebuild loads the latest snapshot and replays only events since that point. Created every 10M events by default; last 5 snapshots retained for rollback
6The relationships flow top-down: Event Log → both projections (via tail read). Billing projection → dedup_set (via lookup). Event Log → snapshots (via periodic checkpoint). All arrows lead back to the Event Log as the ultimate authority

Pseudocode

// EVENT LOG — Immutable, append-only (the only "real" table)
// Write: O(1) append, ~5ms per event
// Read: sequential scan (tail) or range scan (replay)
event_log.append({
    sequence_number: next_seq(),           // Monotonic, gap-free
    event_type:      "ClickReceived",
    aggregate_id:    "camp_123",           // Partition key for locality
    schema_version:  2,                    // Forward-compatible evolution
    payload:         { campaign_id, ad_id, user_id, click_id },
    event_time:      "2025-01-15T12:00:03Z",
    stored_at:       now()
})
// NEVER: UPDATE event_log SET ...
// NEVER: DELETE FROM event_log WHERE ...

// DASHBOARD PROJECTION — Derived, rebuildable
// Projector reads from event_log WHERE seq > last_seq_processed
dashboard_projection = {
    campaign_id:       "camp_123",
    window_start:      "2025-01-15 12:00",
    click_count:       847,
    unique_users:      612,
    ctr:               0.034,
    last_seq_processed: 15_847_293         // Bookmark for resume
}

// BILLING PROJECTION — Derived, with exact dedup
// Projector checks dedup_set before counting
if not dedup_set.contains(click_id):
    dedup_set.add(click_id)
    billing_projection[campaign_id][hour].exact_clicks += 1
    billing_projection[campaign_id][hour].billable_amount =
        exact_clicks * cpc_rate

// SNAPSHOT — Periodic checkpoint for fast rebuild
projection_snapshots.save({
    projection_name: "dashboard_v2",
    snapshot_seq:    15_000_000,           // Resume from here, not seq#0
    state_data:      serialize(dashboard_projection),  // ~500 MB
    created_at:      now()
})
// Rebuild time: ~17min from seq#0 vs ~3min from latest snapshot

Key Design Decisions

Event Sourcing (Append-Only Log)

Choice

Immutable event log as single source of truth

Rationale

Traditional CRUD databases store current state — once a row is updated, the previous value is lost. Event sourcing stores every state change as an immutable event, enabling full reconstruction of any historical state. For ad billing, this means any dispute can be resolved by replaying events. The trade-off is storage cost (all events stored forever) and query complexity (current state must be computed from events).

Separate Dashboard and Billing Projections

Choice

Two independent projections from the same event stream

Rationale

Dashboard and billing queries have fundamentally different requirements. Dashboards need freshness (1-5 second lag acceptable) and time-series access patterns. Billing needs accuracy (exact dedup, ACID guarantees) and aggregation access patterns. Building both from the same event stream ensures consistency while allowing each to optimize for its access pattern. If a third access pattern emerges (e.g., fraud detection), you add a third projection without changing the write path.

Command Side Validation Only

Choice

Command Service validates but does not aggregate

Rationale

The command side's only job is to validate click events and append them to the event log. It does not compute aggregates, check deduplication, or query current state. This keeps the write path fast (5-8ms) and simple. All 'intelligence' lives in the projections on the query side, where it can be changed, versioned, and replayed independently of the write path.

Projection Replay Capability

Choice

All projections can be rebuilt from the event log

Rationale

This is the core event sourcing guarantee. If a projection has a bug, you fix the code and rebuild from scratch. If you need a new aggregation dimension, you create a new projection and replay. If a database is corrupted, you rebuild from the event log. The cost is rebuild time — at 100M events, a full replay takes 1-4 hours depending on projection complexity. Snapshots (periodic checkpoints) reduce this to minutes for common cases.

Redis Cache for Hot Queries

Choice

Cache layer between Query Service and databases

Rationale

Even with optimized projections, database queries have 5-20ms latency. Redis reduces this to sub-2ms for the most popular campaigns. The cache is populated by the Dashboard Projector (push-based) rather than by the Query Service (pull-based), ensuring the cache is always warm for active campaigns. Cache invalidation is event-driven — the projector updates the cache as it processes new events.

Scale & Performance

Target RPS

100K+ (write path), unlimited read scaling

Latency (p99)

5-8ms write, 1-5s dashboard lag, sub-2ms cached reads

Storage

~3 TB/year (immutable event log + projections)

Availability

99.99% (independent write/read paths)

Time & Space Complexity

Operation	Time	Space	Notes
Click command (append to Event Log)	O(1) — single append to immutable log	O(1) — one event record per click	5-8ms per write. The command side does zero aggregation — validation only. This is the fastest write path among all variants for its simplicity.
Dashboard projection (real-time tail read)	O(1) per event — hash map increment in the projector's in-memory state	O(C × W) — campaigns × active time windows in the projection	1-5 second lag from Event Log to Dashboard DB. The projector processes events sequentially per partition — throughput limited by single-thread processing speed (~100K events/sec per consumer).
Billing projection (with deduplication)	O(1) amortized per event — hash set lookup for click_id dedup + counter increment	O(D) — where D is the number of unique click_ids in the dedup set (~3.2 GB for 100M IDs)	5-30 second lag due to dedup overhead. The dedup set is the memory bottleneck — periodically trimmed to IDs from the last 7 days.
Projection rebuild (full replay)	O(E) — where E is the total number of events in the log (sequential scan)	O(P) — full projection state must fit in memory during rebuild	At 100M events, full rebuild takes ~17 minutes at 100K events/sec. Snapshots reduce this to ~3 minutes by starting from a recent checkpoint.

Database Schema (HLD)

event_log

The single source of truth for the entire system — an immutable, append-only log of every ClickReceived event. Each event gets a monotonically increasing sequence number and is never modified after write. Partitioned by campaign_id (aggregate_id) for processing locality. Supports two read patterns: tail reading (projections consuming new events in real time) and replay reading (rebuilding projections from seq#0 or from a snapshot). Events are versioned for schema evolution — old events retain their original schema version while new events use the latest.

sequence_number BIGINT PKevent_type TEXT ('ClickReceived')aggregate_id TEXT (campaign_id)schema_version INTpayload JSONB (click data)event_time TIMESTAMPTZstored_at TIMESTAMPTZ

Partition: aggregate_id (campaign_id)

Indexes: PK on sequence_number, idx_aggregate ON (aggregate_id, sequence_number), idx_event_time ON (event_time)

Immutable after write — no UPDATEs or DELETEs ever. At 100K events/sec write throughput, replay speed ~100K events/sec per consumer. Storage grows ~3 TB/year.

dashboard_projection

Materialized view built by the Dashboard Projector from the event log. Stores time-bucketed click counts per campaign, optimized for recent time-range queries (last 1-6 hours). Updated in real time as the projector processes new events (1-5 second lag). Can be rebuilt from scratch by replaying the event log from sequence #0 — this is the core event sourcing guarantee. If the projection logic has a bug, fix the code and rebuild.

campaign_id TEXT PKwindow_start TIMESTAMPTZ PKclick_count BIGINTunique_users BIGINTctr FLOATlast_seq_processed BIGINT

Indexes: PK on (campaign_id, window_start), idx_recent ON (window_start DESC)

Rebuilding from 100M events takes ~17 minutes at 100K events/sec. Snapshots reduce this to minutes for incremental rebuilds.

billing_projection

Materialized view built by the Billing Projector with strict deduplication. Maintains exact click counts per campaign per hour for billing accuracy. The projector tracks every click_id it has ever seen (via the dedup_set table) to prevent double-counting. Processing lag is 5-30 seconds (more work per event than the dashboard projector). Can also be fully rebuilt from the event log.

campaign_id TEXT PKhour_bucket TIMESTAMPTZ PKexact_clicks BIGINTdeduped_clicks BIGINTbillable_amount DECIMALlast_seq_processed BIGINT

Indexes: PK on (campaign_id, hour_bucket)

Billing disputes are resolved by replaying the event log through this projector — deterministic, reproducible count every time.

dedup_set

Tracks every click_id the Billing Projector has processed. Used for exact deduplication — if a click_id appears twice in the event log (from producer retries or network issues), only the first occurrence is counted. Unlike Bloom filters (which have false positives), this set provides exact dedup at the cost of memory. At 100M unique click_ids, the set requires ~3.2 GB of memory.

click_id TEXT PKcampaign_id TEXTfirst_seen TIMESTAMPTZ

Indexes: PK on click_id

Periodically trimmed: click_ids older than 7 days are removed (matching Kafka retention) since duplicates beyond that window are astronomically unlikely.

projection_snapshots

Periodic checkpoints of projection state used to speed up rebuilds. Instead of replaying from sequence #0 (hours for large event logs), a projection can resume from the latest snapshot. Each snapshot captures the full projection state at a specific sequence number. Created automatically every N events or on a schedule. Multiple snapshots are retained for rollback capability.

projection_name TEXT PKsnapshot_seq BIGINT PKstate_data BLOB (serialized projection state)created_at TIMESTAMPTZ

Snapshot size: ~500 MB for the dashboard projection at 100M events. Created every 10M events by default. Retention: last 5 snapshots.

Event Contracts

ClickReceivedclick-events

Immutable event appended to the Event Log when a click passes command-side validation. This is the single source of truth — all projections derive their state from this stream.

Key Schema

aggregate_id / campaign_id (STRING)

Value Schema

{ sequence_number: BIGINT, event_type: STRING, aggregate_id: STRING, schema_version: INT, payload: { click_id: STRING, campaign_id: STRING, ad_id: STRING, user_id: STRING }, event_time: TIMESTAMP }

ProjectionUpdatedprojection-updates

Emitted when a projector updates its materialized view. Used to trigger cache warming in Redis and to notify downstream systems that fresh data is available.

Key Schema

projection_name + campaign_id (STRING)

Value Schema

{ projection_name: STRING, campaign_id: STRING, window_start: TIMESTAMP, click_count: BIGINT, unique_users: BIGINT, last_seq_processed: BIGINT }

ReconciliationCompletebilling-reconciliation

Emitted when the Billing Projector finishes processing all events up to a given sequence number. Signals that billing data is current and accurate through that point.

Key Schema

hour_bucket (STRING)

Value Schema

{ hour_bucket: TIMESTAMP, last_seq_processed: BIGINT, campaigns_processed: INT, total_exact_clicks: BIGINT, total_deduped: BIGINT, completed_at: TIMESTAMP }

What-If Scenarios

Dashboard Projector has a bug that under-counts clicks for campaigns starting with 'camp_A'

Impact

All campaigns matching the pattern show incorrect click counts on dashboards. Advertisers notice discrepancies and file support tickets. The billing projection is unaffected (different code path), so invoices remain correct, but the dashboard data is wrong.

Mitigation

This is where event sourcing shines: fix the bug in the Dashboard Projector code, then replay the entire Event Log through the corrected projector. The Dashboard DB is rebuilt with accurate counts — no data loss, no manual correction, deterministic and reproducible. The replay takes ~17 minutes for 100M events or ~3 minutes from the latest snapshot.

Event Log storage fills up (disk exhaustion on the event store)

Impact

The Command Service cannot append new events — all click ingestion fails with 503 errors. Both projections continue processing their backlog but receive no new events. The entire system effectively halts for new data while existing data remains accessible.

Mitigation

Monitor Event Log disk usage with alerts at 70% capacity. Implement log compaction or archival to S3 for events older than 30 days. Scale storage horizontally by adding partitions across additional disks or nodes. The Event Log should be on provisioned IOPS storage (io2) with auto-expanding volume policy.

Both projectors fall behind the Event Log by >1 hour during a traffic spike

Impact

Dashboard data is over 1 hour stale — advertisers cannot see recent campaign performance. Billing data is similarly delayed, potentially affecting real-time budget enforcement (campaigns may overspend their daily budget before the billing projection catches up).

Mitigation

Auto-scale projector instances based on consumer lag (add instances when lag >5 minutes). For the Dashboard Projector, implement a fast-forward mode that skips detailed per-event processing and reads pre-aggregated batches from the Event Log. For the Billing Projector, increase instance memory to handle larger dedup sets without GC pauses.

Schema version 3 events are published but the Billing Projector only handles v1 and v2

Impact

The Billing Projector encounters unknown schema version events and either crashes (if no error handling) or skips them (if defensive). Skipped events mean under-counted clicks in billing — advertisers are under-billed. If it crashes, the projection falls behind and billing data goes stale.

Mitigation

Schema Registry with forward compatibility enforcement — v3 schema must be registered and compatible before producers can publish v3 events. All projectors include a default handler for unknown schema versions that logs the event and applies a best-effort extraction of known fields. Alert on unknown schema version encounters.

Failure Modes & Resilience

Component	Failure	Impact	Mitigation
Event Log (event store)	Write unavailability	All click ingestion fails — the Command Service cannot append events. Both projections continue processing existing events but receive nothing new. Complete write-path outage until the event store recovers.	Deploy the Event Log on a distributed store (e.g., EventStoreDB cluster, Kafka with replication.factor=3, or Aurora PostgreSQL Multi-AZ). Cross-AZ replication ensures a single node failure does not cause write unavailability.
Dashboard Projector	Crash during event processing	Dashboard data goes stale — the last_seq_processed bookmark stops advancing. The Dashboard DB still serves the last written data but does not update with new clicks. Billing projection is unaffected (independent process).	The projector tracks last_seq_processed in the Dashboard DB. On restart, it resumes from the bookmark — no data loss, just a brief increase in dashboard lag. Auto-restart via Kubernetes liveness probe within 30 seconds.
Billing Projector	OOM from dedup set growth	The Billing Projector crashes when the dedup set exceeds available memory. Billing data goes stale at the point of crash. Dashboard projection continues normally. Billing reports for the affected period are delayed.	Size instances at 2x expected dedup set (r7g.2xlarge with 64 GB for 100M+ unique click_ids). Implement periodic dedup set trimming (remove click_ids older than 7 days). Use RocksDB-backed state store instead of pure in-memory set for larger-than-RAM dedup sets.
Redis Cache	Node failure / split brain	Dashboard queries bypass cache and hit the Dashboard DB directly. Latency increases from ~2ms to 5-20ms. If many campaigns are queried simultaneously, the Dashboard DB connection pool may saturate.	ElastiCache Multi-AZ with automatic failover. The Dashboard Projector repopulates the cache as it processes events — cache warm-up is automatic. The Query Service implements a local in-memory fallback cache (30-second TTL) for the top 100 campaigns.

Scaling Strategy

Command Service: CPU-based auto-scaling from 6 to 24 pods (target 60% utilization). Event Log: scale via partitioning — more partitions enable more parallel projector consumers. Dashboard Projector: one instance per Event Log partition (match partition count). Billing Projector: scale independently based on dedup set size and processing lag. Query Service: request-rate auto-scaling from 4 to 16 pods. Redis: vertical scaling or cluster mode for larger working sets. The write path ceiling is limited by Event Log throughput (~100K RPS per partition with 16 partitions). The read path scales independently and has no practical ceiling since projections can be replicated.

Monitoring & Alerting

Key metrics: (1) Event Log write latency — alert if p99 >20ms (normal is 5-8ms). (2) Dashboard Projector lag (last_seq_processed vs Event Log head) — alert if >10K events behind for 3 minutes. (3) Billing Projector lag — alert if >50K events behind for 5 minutes. (4) Dedup set size — alert if >80% of allocated memory. (5) Projection rebuild duration — track and alert if >30 minutes (snapshot may be stale). (6) Redis cache hit rate — alert if <60%. (7) Event schema version distribution — alert on unknown versions. Dashboard: Grafana with panels for Event Log write rate, projector lag gauges (dashboard vs billing), dedup set memory usage, cache hit rate, and projection rebuild history. SLIs: event append p99 <15ms, dashboard freshness <10 seconds, billing projection lag <60 seconds, projection rebuild <20 minutes.

Cost Analysis

At ~100K RPS target: Event Log store (EventStoreDB or MSK 3-broker kafka.m7g.xlarge, $810/month), Dashboard DB RDS db.r7g.large ($185/month), Billing DB RDS db.r7g.large ($185/month), ElastiCache Redis cache.r7g.large ($150/month), ECS Fargate Command Service 6 pods ($390/month), Dashboard Projector 16 pods ($520/month), Billing Projector 8 pods ($260/month), Query Service 4 pods ($260/month), ALB ($25/month). Total: ~$2,785/month. Cost per million clicks: ~$0.011. The event store and projector pods are the largest cost drivers. Snapshot storage adds ~$15/month (S3). The total is lower than V2 Lambda ($5,800) because there is no batch processor infrastructure, but higher than V1 ($2,200) due to the dual-projection overhead and event store requirements.

Security Considerations

Command-side authentication: API key validated before event append — unauthenticated events never enter the Event Log. Event Log immutability: no UPDATE or DELETE operations permitted at the application or database level (enforce via IAM policies and DB roles). Click fraud: the Billing Projector's exact dedup set prevents all duplicate-based fraud. Audit trail: the immutable Event Log serves as a complete, tamper-proof audit trail for regulatory compliance (SOC 2, ad industry TAG certification). IP hashing: SHA-256 applied in the Command Service before the event is appended — raw IPs are never stored. GDPR right-to-erasure: handled via crypto-shredding — user_id is encrypted with a per-user key; deleting the key renders all that user's events unreadable without modifying the immutable log.

Deployment Strategy

Rolling deployment for the Command Service (stateless, simple pod replacement). Projectors use a blue-green pattern: deploy the new projector version alongside the old one, both reading from the Event Log. Verify the new version produces identical outputs for a 10-minute window, then decommission the old version. Event Log schema changes go through the Schema Registry with compatibility checks. Projection rebuilds are triggered post-deployment if the projection logic changed. Rollback: revert Command Service to previous image; for projectors, revert to previous version and trigger a rebuild from the latest snapshot.

Real-World Examples

•Snap Ads uses an event-sourced architecture for ad engagement tracking where the immutable event log enables full auditability for advertiser billing disputes
•LinkedIn Marketing Solutions uses CQRS-inspired separation between their ad event ingestion (command side) and analytics projections (query side) for independent scaling
•EventStoreDB (the database) was originally built for financial trading systems with the same event sourcing + projection pattern used here for ad billing accuracy
•Greg Young's canonical CQRS/ES examples (inventory management, trading) map directly to ad click aggregation — high write throughput with multiple derived read models

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Single Service + SQL)	T1	~25ms click ingestion	~500 RPS	$385/month	Low (4 components)	99% (single DB)
V1: Stream Processing (Kafka + Windowed)	T2	<5ms click ack	~50K RPS	$2,200/month	Medium (8 components)	99.9% (multi-AZ)
V2: Lambda Architecture (Speed + Batch)	T3	<8ms click ack	~200K RPS	$5,800/month	High (12 components)	99.99% (dual-path)
V3: CQRS + Event Sourcing	T4	5-8ms write	~100K RPS	$4,500/month	Very High (10 components)	99.99% (independent paths)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

What is the difference between CQRS and Event Sourcing?

CQRS (Command Query Responsibility Segregation) is a pattern that separates the write model from the read model. You can do CQRS without event sourcing — e.g., a write database and a read replica. Event Sourcing is a pattern that stores state changes as immutable events rather than current state. You can do event sourcing without CQRS — e.g., one event log with one projection. In practice, they are often combined because event sourcing naturally produces an append-only write model and derived read models (projections), which is exactly the CQRS separation.

How long does a projection rebuild take?

Projection rebuild time depends on event volume and projection complexity. For a simple counter projection (sum clicks by campaign), the bottleneck is event log read throughput — approximately 100K events/second per consumer, so 100M events takes ~17 minutes. For a complex projection with deduplication (maintaining a set of seen click_ids), memory becomes the bottleneck — at 100M unique click_ids, the set requires ~3.2 GB of memory. Snapshots (periodic checkpoints of projection state) can reduce rebuild time to minutes by starting from the latest snapshot rather than the beginning.

How does event sourcing handle schema changes?

Events in the log are versioned. When the ClickReceived event schema changes (e.g., adding a device_type field), new events use v2 while old events remain v1. Projections include a schema registry that maps event versions to processing logic. The v2 projection handler extracts device_type directly; the v1 handler applies a default or heuristic. This forward-compatible approach means old events never need to be migrated — they are processed as-is by version-aware handlers.

What are the consistency guarantees between command and query sides?

The command side is strongly consistent — once a write is acknowledged, the event is durably persisted and will never be lost. The query side is eventually consistent — projections process events with a lag of 1-30 seconds depending on the projection's processing speed. This means a user might click an ad and not see the count increment for a few seconds on the dashboard. For billing, this lag is irrelevant because billing reports are generated hourly/daily. For dashboards, 1-5 second lag is acceptable.

When is CQRS + Event Sourcing overkill for ad click aggregation?

CQRS + Event Sourcing is overkill when: (1) you do not need full audit trails or billing dispute resolution, (2) your team is not experienced with event-driven architectures (the learning curve is steep), (3) traffic is under 10K RPS (stream processing is simpler and sufficient), or (4) you do not anticipate needing new aggregation dimensions or projection changes over time. The Lambda Architecture variant provides similar billing accuracy with a more familiar batch processing model.

Related Templates

Ad Click Aggregator — Lambda Architecture Ad Click Aggregator — Stream Processing Ad Click Aggregator — Naive

Discussion

Ready to design your own Ad Click Aggregator?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator