Vetora logo
Medium4 componentsInterview: Very High

Ad Click Aggregator — Naive (Single Service + SQL)

The simplest ad click aggregation system: one Click Service writing synchronously to PostgreSQL. Demonstrates why async ingestion, caching, and stream processing become essential as click volume grows.

Data PipelineBeginnerBottleneck AnalysisAdTech
Problem Statement

Ad click aggregation is a deceptively simple problem that hides enormous scale challenges. At its core, the requirement is straightforward: when a user clicks an ad on a publisher website, record that click and aggregate counts per campaign so advertisers can see how their ads are performing. The naive approach — a single service writing directly to a SQL database — works fine for a prototype, but it fails spectacularly at production scale for reasons that are deeply instructive.

The fundamental tension in ad click aggregation is between write throughput and query freshness. Advertisers want real-time dashboards showing click counts per campaign, per ad, per hour. Publishers want to know which ads are generating revenue. And the billing system needs exact click counts — not approximate, not eventually consistent, but precisely correct — because every click translates to money changing hands. A single miscounted click across millions of campaigns can result in billing disputes, lost advertiser trust, and regulatory issues.

The naive architecture uses synchronous writes: every click event triggers an INSERT into PostgreSQL. The dashboard reads aggregate counts via GROUP BY queries on the same table. This means every read competes with every write for database resources. At 100 clicks per second, this works fine — PostgreSQL handles the mixed workload comfortably. At 1,000 clicks per second, write latency starts climbing as the WAL (write-ahead log) becomes a bottleneck. At 10,000 clicks per second, the system falls over: the connection pool is exhausted, GROUP BY queries take seconds instead of milliseconds, and the write path starts dropping clicks.

This template exists to make these failure modes visible and measurable. By running the simulation at increasing traffic levels, you can observe exactly when the database becomes the bottleneck, how write amplification from indexes degrades INSERT performance, and why the synchronous write path creates a hard ceiling on throughput. The comparison with stream processing and lambda architecture variants quantifies the dramatic improvement that async ingestion provides.

Ad click aggregation appears frequently in system design interviews at Google, Meta, Amazon, Twitter, and LinkedIn — any company with an advertising platform. Interviewers expect candidates to identify the naive approach's limitations, propose async ingestion via message queues, discuss deduplication strategies for click fraud prevention, and reason about consistency models for billing accuracy. This template provides the concrete simulation data to support those discussions.

Architecture Overview

The naive ad click aggregator is a linear four-component architecture: Client, Load Balancer, Click Service, and PostgreSQL database. There is no message queue, no cache, no stream processor, and no separation between the write path and the read path.

All click events arrive at the Load Balancer, which distributes them across Click Service pods using round-robin. The Load Balancer adds approximately 1.5ms of routing latency and acts as the single entry point. It handles up to 5,000 RPS before becoming a bottleneck itself, but the database will saturate long before the load balancer does.

The Click Service is a stateless API server running multiple pods with configurable thread pools. Each click event triggers two operations: (1) an INSERT into the clicks table with the click metadata (campaign_id, ad_id, user_id, timestamp, publisher_id, ip_address), and (2) an UPDATE to the campaign_aggregates table incrementing the click count for the relevant campaign. Both operations happen synchronously within the same request — the service does not return a 200 OK until both the INSERT and UPDATE have been committed to the database.

This synchronous dual-write pattern is the root cause of the architecture's scaling limitations. Each click event requires two round trips to PostgreSQL: one for the raw event INSERT (~10ms) and one for the aggregate UPDATE (~15ms with row-level locking). At 1,000 RPS, the service spends 25ms per click on database operations alone, requiring at least 25 concurrent threads just to keep up — not counting CPU processing time for request parsing, validation, and response serialization.

PostgreSQL stores two tables. The clicks table receives every raw click event with a B-tree index on (campaign_id, created_at) for time-range queries. As the table grows, INSERT performance degrades because each new row requires updating the index. The campaign_aggregates table maintains running totals with a row-level lock on UPDATE to prevent lost updates from concurrent increments. Under high concurrency, this row-level locking creates contention — multiple threads waiting to increment the same campaign's counter.

Dashboard queries run directly against the same database instance. A typical query aggregates clicks by campaign and hour using GROUP BY with a WHERE clause on the time range. At 1 million rows, this query takes approximately 50ms. At 100 million rows, it takes several seconds — competing with the write path for I/O bandwidth and buffer pool memory. There is no materialized view, no pre-computed aggregate, and no read replica to offload query traffic.

The architecture has zero redundancy at the data layer. A single PostgreSQL instance handles all reads and writes. If the database fails, both the write path (click ingestion) and the read path (dashboard queries) go down simultaneously. There is no deduplication mechanism — if a user's browser retries a click request (network timeout, double-click), the system records two clicks for one actual interaction. This is a billing accuracy risk that more sophisticated architectures address with idempotency keys or stream-level deduplication.

Architecture Preview
Loading architecture preview...
Request Flow — Synchronous Dual-Write

This sequence diagram traces the life of a single ad click from browser to database and back. The critical insight is the synchronous dual-write pattern: the Click Service cannot respond to the client until both the INSERT (raw event storage) and UPDATE (aggregate counter increment) have committed to PostgreSQL. Every click's end-to-end latency includes two full database round trips, and the system's maximum throughput is bounded by the database's write capacity — not the service's CPU or the network.

Loading diagram...

Step-by-Step Walkthrough

  1. 1Browser sends POST /click with campaign_id, ad_id, and user metadata to the Load Balancer
  2. 2Load Balancer routes the request to a Click Service pod via round-robin (~1.5ms routing overhead)
  3. 3Click Service executes INSERT INTO clicks to persist the raw click event as a new row (~10ms with index update)
  4. 4Click Service executes UPDATE campaign_aggregates SET click_count = click_count + 1, acquiring a row-level lock on the target campaign (~15ms + potential lock wait if another thread is updating the same campaign)
  5. 5Only after both writes commit does the Click Service return 200 OK — total ~28ms at low load, but degrades non-linearly under contention
  6. 6Dashboard queries (GET /dashboard) hit the same PostgreSQL instance, competing for the connection pool and buffer cache
  7. 7The GROUP BY aggregation scans the B-tree index on clicks — ~50ms at 1M rows, scaling to seconds at 100M rows as the index depth grows

Pseudocode

// Click handler — synchronous dual-write (the bottleneck)
async function handleClick(campaign_id, ad_id, user_id):
    // Step 1: Persist raw event (append to clicks table)
    await db.execute(
        "INSERT INTO clicks (id, campaign_id, ad_id, user_id, created_at)
         VALUES (gen_random_uuid(), $1, $2, $3, now())",
        [campaign_id, ad_id, user_id]
    )   // ~10ms — includes B-tree index update (write amplification)

    // Step 2: Increment aggregate counter (row-level lock)
    hour = date_trunc('hour', now())
    await db.execute(
        "INSERT INTO campaign_aggregates (campaign_id, hour_bucket, click_count)
         VALUES ($1, $2, 1)
         ON CONFLICT (campaign_id, hour_bucket)
         DO UPDATE SET click_count = click_count + 1, updated_at = now()",
        [campaign_id, hour]
    )   // ~15ms — row lock contention under concurrency

    return 200  // Total: ~28ms (low load) → 200ms+ (high contention)

// Dashboard query — competes with writes for same DB resources
async function getDashboard(campaign_id, start_time, end_time):
    rows = await db.execute(
        "SELECT hour_bucket, click_count, unique_users
         FROM campaign_aggregates
         WHERE campaign_id = $1
           AND hour_bucket BETWEEN $2 AND $3
         ORDER BY hour_bucket",
        [campaign_id, start_time, end_time]
    )   // ~50ms at 1M rows → seconds at 100M rows
    return rows
Database Schema (ER Diagram)

The schema reveals the fundamental tension of the naive approach: two tables serving opposite access patterns on the same database instance. The clicks table is write-optimized (append-only INSERTs at high frequency), while campaign_aggregates is read-optimized (pre-computed sums for fast dashboard queries). Both compete for PostgreSQL's shared buffer pool, WAL bandwidth, and connection pool — creating mutual degradation as traffic increases.

Loading diagram...

Step-by-Step Walkthrough

  1. 1The clicks table stores every raw ad click as a new row — it is append-only and never updated after insertion
  2. 2A B-tree index on (campaign_id, created_at) enables fast time-range queries but causes write amplification: each INSERT must also update the index tree structure
  3. 3At ~3 GB/day growth (1K clicks/sec), the index tree depth increases over time, gradually degrading INSERT performance
  4. 4campaign_aggregates stores pre-computed running totals per campaign per hour — avoiding full-table scans for dashboard queries
  5. 5Each click triggers an UPDATE with row-level locking: multiple concurrent threads trying to increment the same campaign's counter will serialize on the lock
  6. 6With no deduplication mechanism, a retried click (network timeout, double-click) inserts a second row and increments the aggregate twice — a billing accuracy risk

Pseudocode

-- Write path: every click triggers BOTH operations
-- (1) Append raw event
INSERT INTO clicks (id, campaign_id, ad_id, user_id, publisher_id, ip_address, created_at)
VALUES (gen_random_uuid(), 'camp_123', 'ad_456', 'user_789', 'pub_abc', '10.0.0.1', now());
-- Cost: ~10ms + B-tree index update on (campaign_id, created_at)
-- Write amplification: 1 data page + 1-2 index pages per INSERT

-- (2) Increment aggregate (row lock acquired here)
UPDATE campaign_aggregates
SET click_count = click_count + 1, updated_at = now()
WHERE campaign_id = 'camp_123'
  AND hour_bucket = date_trunc('hour', now());
-- Cost: ~15ms + row lock wait if contended
-- At 500+ concurrent writes to same campaign: lock wait > 50ms

-- Read path: dashboard query (competes with writes)
SELECT campaign_id, hour_bucket, SUM(click_count) as total
FROM campaign_aggregates
WHERE campaign_id = 'camp_123'
  AND hour_bucket BETWEEN '2025-01-01' AND '2025-01-02'
GROUP BY campaign_id, hour_bucket
ORDER BY hour_bucket;
-- Cost: ~50ms at 1M rows → seconds at 100M rows
Key Design Decisions
Synchronous Dual-Write

Choice

INSERT raw event + UPDATE aggregate in same request

Rationale

This is the simplest consistency model: the response is not sent until both the event and the aggregate are persisted. There is no eventual consistency lag, no out-of-sync state between raw data and aggregates. The trade-off is throughput — each request requires two database round trips, doubling the per-click latency and halving the maximum throughput compared to write-behind or async patterns.

No Message Queue

Choice

Direct synchronous writes to PostgreSQL

Rationale

A message queue (Kafka, SQS) adds operational complexity: topic management, consumer groups, offset tracking, dead letter queues. For a prototype handling under 500 RPS, the direct-write approach avoids this complexity entirely. The cost is that the write path cannot absorb traffic spikes — if the database slows down, back-pressure propagates directly to the client as increased latency.

No Deduplication

Choice

Accept all click events as-is

Rationale

Deduplication requires maintaining a set of recently seen event IDs (in Redis, Bloom filters, or database unique constraints with ON CONFLICT). For a naive prototype, skipping deduplication keeps the architecture simple but creates a billing accuracy risk: duplicate clicks from network retries or malicious actors inflate counts.

Single Database Instance

Choice

One PostgreSQL for reads and writes

Rationale

A single database eliminates replication lag, split-brain concerns, and failover complexity. The trade-off is that read and write workloads compete for the same resources. Dashboard GROUP BY queries and click INSERTs share the buffer pool, I/O bandwidth, and connection pool — creating mutual degradation under load.

Scale & Performance

Target RPS

~500 sustained (ceiling)

Latency (p99)

~25ms per click (low load)

Storage

~100 GB/year at 1K clicks/sec

Availability

~99% (single instance, no redundancy)

Time & Space Complexity
OperationTimeSpaceNotes
Click ingestion (POST /click)O(1) — single INSERT + single UPDATE per clickO(1) — one row in clicks table, one counter incrementTwo synchronous DB round trips (~25ms total). Connection pool contention adds non-linear latency above 300 concurrent requests.
Aggregate update (UPDATE campaign_aggregates)O(1) — row-level lock + counter incrementO(C) — where C is the number of active campaigns × hour bucketsRow-level locking serializes concurrent updates to the same campaign. At 500+ concurrent writes to one campaign, lock wait exceeds 50ms.
Dashboard query (GROUP BY campaign + hour)O(log N) — B-tree index scan on (campaign_id, hour_bucket)O(W) — where W is the number of hour windows in the query rangeIndex-only scan at small scale (~50ms at 1M rows). Degrades to full table scan at 100M+ rows if index does not fit in buffer pool.
Database Schema (HLD)
clicks

Raw click event log — every ad click is an INSERT. This is the append-only source of truth for billing and audit. At 1K clicks/sec, the table grows ~3 GB/day. B-tree index on (campaign_id, created_at) supports time-range dashboard queries but degrades INSERT performance as the table grows — each new row updates the index, causing write amplification.

id UUID PKcampaign_id TEXTad_id TEXTuser_id TEXTpublisher_id TEXTip_address INETcreated_at TIMESTAMPTZ

Indexes: idx_clicks_campaign_time ON (campaign_id, created_at)

No deduplication — duplicate clicks from retries are recorded as separate rows. At 2% dupe rate, ~2K phantom clicks/day at 100K clicks/day volume.

campaign_aggregates

Pre-computed running totals per campaign per hour. Updated via UPDATE ... SET click_count = click_count + 1 on every click — row-level locking creates contention when multiple threads increment the same campaign simultaneously. Dashboard GROUP BY queries read from this table for fast aggregation without scanning the full clicks table.

campaign_id TEXT PKhour_bucket TIMESTAMPTZ PKclick_count BIGINTunique_users BIGINTupdated_at TIMESTAMPTZ

Indexes: PK on (campaign_id, hour_bucket)

Row-level locks on UPDATE cause thread contention under high concurrency. At 500+ concurrent writes to the same campaign, lock wait time exceeds 50ms.

What-If Scenarios

Traffic spike during a major ad campaign launch (10x normal volume for 30 minutes)

Impact

PostgreSQL connection pool exhausts within seconds. Write latency jumps from 25ms to 500ms+, then the service starts returning 503 errors. All clicks during the spike are lost because there is no buffer or queue to absorb the burst. Dashboard queries also fail since they share the same connection pool.

Mitigation

The Stream Processing variant (V1) absorbs spikes via Kafka buffering — the ingestion service acks in 3ms regardless of downstream load. For V0, the only mitigation is vertical scaling (larger RDS instance) or adding PgBouncer for connection pooling, which delays but does not prevent saturation.

PostgreSQL primary fails with no replica configured

Impact

Complete outage: zero click ingestion, zero dashboard reads. Every in-flight request fails. Depending on recovery time (15-60 minutes for RDS automated recovery), thousands to millions of clicks are permanently lost. There is no event log or queue to replay missed events.

Mitigation

Add a Multi-AZ RDS standby for automatic failover (~30 seconds). For true resilience, migrate to V1 where Kafka durably buffers events during DB outages and replays them on recovery — zero data loss.

Bot traffic generates 50K fraudulent clicks per hour on a single campaign

Impact

No deduplication means every bot click is counted and billed. The campaign_aggregates counter inflates by 50K/hour. The advertiser is overbilled. The write path is also stressed since all bot clicks target the same campaign_id, creating severe row-level lock contention on that campaign's aggregate row.

Mitigation

Add IP-based rate limiting at the Load Balancer (max 10 clicks/minute per IP). For production fraud prevention, V2 Lambda's batch layer performs exact deduplication using click_id sets, catching all duplicates in hourly reconciliation.

Dashboard reporting query scans a full year of data (365 × 24 hour buckets)

Impact

The GROUP BY query on campaign_aggregates becomes a sequential scan across ~8,760 rows per campaign. If the query spans multiple campaigns, it locks shared buffer pages. Write latency spikes as the dashboard query consumes I/O bandwidth and buffer pool memory for 5-15 seconds.

Mitigation

Add a read replica for reporting queries. Implement materialized views for common time ranges (last 7 days, last 30 days). The V1 variant separates read and write paths entirely, so dashboard queries never affect ingestion performance.

Failure Modes & Resilience
ComponentFailureImpactMitigation
PostgreSQL (single instance)Connection pool exhaustionAll requests fail simultaneously — both click ingestion and dashboard reads return errors. No graceful degradation since both paths share one database. Revenue-impacting outage.Deploy PgBouncer for connection pooling (transaction mode). Set max_connections to 200+ with appropriate shared_buffers. Alert at 70% utilization. For structural fix, migrate to V1 where Kafka decouples ingestion from DB writes.
Click ServicePod crash during synchronous writeIn-flight requests are lost — the click was not yet committed to the DB. The Load Balancer routes subsequent requests to remaining pods, so availability degrades proportionally. If all pods crash, 100% of clicks are lost.Run minimum 3 pods across 2 AZs. Configure health checks with 10-second intervals. The Load Balancer automatically drains traffic from unhealthy pods within 15 seconds.
Load BalancerMisconfigured health check marks all pods unhealthy502 Bad Gateway for all traffic. Zero click ingestion, zero dashboard reads. The system appears completely down even though the Click Service pods are healthy.Use multi-criteria health checks (TCP + HTTP /health endpoint). Set deregistration delay to 30 seconds. Configure CloudWatch alarm on UnHealthyHostCount > 0. Multi-AZ ALB provides infrastructure-level redundancy.
Scaling Strategy

Vertical scaling for PostgreSQL: db.r7g.large (500 RPS ceiling) → db.r7g.xlarge (~1K RPS) → db.r7g.2xlarge (~1.5K RPS, diminishing returns). Horizontal scaling for Click Service: 2 → 4 → 8 pods via ECS auto-scaling (CPU >70% for 3 minutes). The database is always the bottleneck — adding more service pods only helps until the connection pool is saturated. Read replicas can offload dashboard queries but do not help the write path. Hard ceiling: ~1.5K RPS with maximum vertical scaling. Beyond this, architectural changes are required (Kafka buffering, async writes) — see V1.

Monitoring & Alerting

Key metrics: (1) PostgreSQL active connections — alert at 70% of max_connections (~140/200), critical at 85%. (2) Click ingestion p99 latency — alert if >100ms sustained for 5 minutes (normal is ~25ms). (3) Write throughput (clicks/sec) — alert if >400 RPS (80% of the ~500 ceiling). (4) campaign_aggregates row lock wait time — alert if p95 >30ms (indicates hot-campaign contention). (5) Dashboard query latency — alert if p99 >500ms. (6) Disk I/O utilization — alert at 75% (WAL write saturation precedes connection pool issues). Dashboard: Grafana with panels for live RPS, DB connection pool gauge, write latency heatmap, lock contention per campaign, and disk IOPS. SLIs: click ingestion p99 <100ms, dashboard p99 <500ms, error rate <0.1%.

Cost Analysis

At ~500 RPS target: PostgreSQL RDS db.r7g.large ($185/month), 500 GB gp3 storage ($45/month), ECS Fargate Click Service 2 pods ($130/month), Application Load Balancer ($25/month). Total: ~$385/month. Cost per million clicks: ~$0.30. This is the cheapest variant but has a hard ceiling at ~500 RPS. Scaling to db.r7g.xlarge ($370/month) extends the ceiling to ~1K RPS but doubles the DB cost. Beyond 1K RPS, vertical scaling has diminishing returns — the V1 Stream variant at 50K RPS costs ~$2,200/month but handles 100x more traffic, reducing cost per million clicks to ~$0.017.

Security Considerations

Authentication: API key validation on every /click request (~1ms overhead). Rate limiting: per-IP rate limit of 100 clicks/minute at the Load Balancer to prevent basic bot attacks. Click fraud prevention: limited to IP-based heuristics — no deduplication, no click_id tracking, no device fingerprinting. IP hashing: store SHA-256 hash of IP addresses instead of raw IPs to comply with GDPR pseudonymization requirements. Data privacy: click data contains user_id and hashed IP — implement 90-day retention policy with automated deletion for GDPR Article 17 compliance. TLS termination at the Load Balancer ensures in-transit encryption.

Deployment Strategy

Rolling deployment for the Click Service — replace one pod at a time while the ALB drains connections from the old pod (30-second deregistration delay). Database schema migrations run during low-traffic windows (2-4 AM) using pg_migrate with transactional DDL. Rollback: revert the Click Service to the previous container image via ECS task definition update (~60 seconds). No blue-green needed at this scale — the simplicity of rolling deployments matches the simplicity of the architecture.

Real-World Examples
  • Early-stage ad networks (pre-Series A) commonly use a single-service + SQL architecture for speed of development before traffic justifies Kafka infrastructure
  • Google AdSense's earliest iteration (2003) used a synchronous PHP + MySQL architecture before migrating to Sawzall and then Millwheel for stream processing
  • Small DTC (direct-to-consumer) brands running self-hosted ad tracking often use this pattern with PostgreSQL — adequate for <1M clicks/day
  • WordPress ad plugins (e.g., Advanced Ads) use synchronous MySQL writes for click tracking — the V0 pattern at its simplest form
Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Single Service + SQL)T1~25ms click ingestion~500 RPS$385/monthLow (4 components)99% (single DB)
V1: Stream Processing (Kafka + Windowed)T2<5ms click ack~50K RPS$2,200/monthMedium (8 components)99.9% (multi-AZ)
V2: Lambda Architecture (Speed + Batch)T3<8ms click ack~200K RPS$5,800/monthHigh (12 components)99.99% (dual-path)
V3: CQRS + Event SourcingT45-8ms write~100K RPS$4,500/monthVery High (10 components)99.99% (independent paths)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
Why is ad click aggregation a common interview question?

Ad click aggregation combines several hard distributed systems problems in one scenario: high write throughput (millions of events/day), real-time aggregation (dashboard freshness), billing accuracy (exactly-once counting), and fraud detection (deduplication). It forces candidates to reason about message queues, stream processing, consistency models, and cost trade-offs. Companies with ad platforms (Google, Meta, Amazon, Twitter) ask it because it directly maps to their production systems.

What happens when the database reaches its connection limit?

When all PostgreSQL connections are in use (default 50-100), new requests wait in a connection queue. If the queue fills up, the Click Service receives connection errors and must either retry or drop the click. This manifests as a latency spike followed by errors — the system transitions from slow to failing. Connection poolers like PgBouncer can help but only delay the inevitable if the underlying database cannot keep up with write throughput.

How does the synchronous write path compare to async ingestion?

Synchronous writes add 20-30ms per click (two DB round trips). Async ingestion via Kafka adds 3-5ms per click (one Kafka produce call). The downstream consumer processes events in batches, reducing per-event database cost by 10-100x through batch INSERTs and pre-computed aggregates. The Stream Processing variant demonstrates this: same click volume, 80% lower database utilization.

At what scale should you migrate from naive to stream processing?

Migrate when database utilization consistently exceeds 70% during peak hours, or when dashboard query latency exceeds your SLO. In this simulation, that inflection point is around 500-1,000 RPS. Below 500 RPS, the naive approach is simpler and cheaper. Above 1,000 RPS, you need either Kafka-based async ingestion or database vertical scaling — and vertical scaling has diminishing returns and increasing cost.

How does missing deduplication affect billing accuracy?

Without deduplication, duplicate clicks from browser retries (1-3% of traffic), network timeouts, or bot traffic directly inflate campaign click counts. At 100K clicks/day with a 2% duplicate rate, that is 2,000 phantom clicks per day — potentially thousands of dollars in overbilled advertiser spend. Ad platforms handle this with idempotency keys (click_id + timestamp), Bloom filters for approximate dedup, or event-stream dedup windows.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Ad Click Aggregator?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator