Vetora logo
🔄Data Engineering

Change Data Capture

Change Data Capture (CDC) tracks row-level changes (inserts, updates, deletes) in a source database and propagates them downstream in real time. Log-based CDC reads the database's transaction log (WAL, binlog) to capture every change without impacting source performance, enabling real-time data replication, event-driven architectures, and streaming ETL.

Overview

Most data in organizations lives in operational databases (PostgreSQL, MySQL, MongoDB). Getting that data to analytical systems (data warehouses, search indexes, caches) has traditionally required periodic batch extraction -- a nightly SQL query that dumps the table. This approach is slow (up to 24 hours of latency), incomplete (misses intermediate states and deletes), and burdensome on the source database (full table scans compete with production traffic).

**Change Data Capture** solves all three problems by reading the database's own transaction log. Every relational database maintains a log of all committed changes for crash recovery and replication: PostgreSQL's Write-Ahead Log (WAL), MySQL's binary log (binlog), MongoDB's oplog, SQL Server's transaction log. CDC tools read this log and emit a stream of change events: each event describes what changed (the before and after state of a row), when it changed, and in what transaction.

**Debezium** is the most widely adopted open-source CDC platform. It runs as a set of Kafka Connect source connectors -- one per source database. Debezium reads the transaction log, converts each change into a structured event (JSON or Avro), and publishes it to a Kafka topic (one topic per table by default). Downstream consumers (Flink, Spark, Kafka Connect sinks, custom applications) subscribe to these topics to react to changes in real time.

CDC has become foundational for modern data architectures. It replaces batch ETL extraction with real-time streaming extraction. It enables the **outbox pattern** for reliable microservice event publishing. It powers real-time cache invalidation (invalidate only the changed records), search index updates (update Elasticsearch without polling), and materialized view maintenance. CDC turns your database into an event stream without changing application code.

Key Points
  • 1Log-based CDC reads the database's transaction log (WAL, binlog, oplog) rather than polling the table with queries. This captures every change (including deletes and intermediate updates), maintains transaction ordering, and adds zero load to the source database's query engine.
  • 2Debezium is the standard open-source CDC tool. It supports PostgreSQL, MySQL, MongoDB, SQL Server, Oracle, Cassandra, and more. It runs as Kafka Connect connectors, producing change events to Kafka topics with exactly-once delivery via Kafka Connect's offset management.
  • 3CDC events include the before and after state of the row, the operation type (c=create, u=update, d=delete, r=read/snapshot), the source transaction ID, and the timestamp. This enables downstream consumers to reconstruct the exact state of the source at any point in time.
  • 4Initial snapshot: when CDC is first configured, Debezium reads the entire current state of each table (a consistent snapshot) before switching to log-based streaming. This ensures the downstream consumer starts with a complete baseline.
  • 5The outbox pattern uses CDC to solve the dual-write problem in microservices. Instead of writing to both the database and Kafka (which cannot be done atomically), the application writes to an 'outbox' table in the same database transaction. Debezium captures the outbox inserts and publishes them to Kafka, ensuring exactly-once event publishing without distributed transactions.
  • 6Schema evolution in CDC requires coordination: when the source schema changes (e.g., adding a column), the CDC events must include the new schema. Debezium integrates with the Confluent Schema Registry (Avro/Protobuf) to handle schema evolution compatibly.
Simple Example

Real-Time Product Search Index

An e-commerce platform has a products table in PostgreSQL with 10 million rows. Whenever a merchant updates a product's price, description, or availability, the change must be reflected in the Elasticsearch search index within seconds. Instead of polling the products table every minute (which loads the database and misses rapid changes), Debezium reads the PostgreSQL WAL and publishes product change events to a Kafka topic. A Kafka Connect Elasticsearch sink connector consumes these events and updates the search index in real time. When a merchant marks a product as out-of-stock, the Elasticsearch document is updated within 2 seconds, and customers immediately stop seeing it in search results. No application code changes were needed -- CDC reads the log transparently.

Real-World Examples

Airbnb

Airbnb uses Debezium-based CDC to stream changes from their MySQL databases to their Hudi data lake on S3. Previously, nightly batch extraction meant the data lake was always 12-24 hours stale. With CDC, the data lake reflects source changes within minutes. This enabled real-time pricing analytics: hosts can see how their pricing compares to similar listings in near-real-time rather than waiting for yesterday's batch.

Shopify

Shopify processes over 10 billion CDC events daily using their internal CDC platform built on MySQL binlog readers. Every change to a shop, product, order, or customer record is captured and streamed to their event bus. This powers real-time search indexing (product updates reflected in Shopify's storefront search within seconds), real-time analytics, and data warehouse replication. CDC replaced hundreds of custom ETL jobs that polled MySQL tables.

Wepay (JPMorgan Chase)

Wepay pioneered the use of Debezium in production for payment data replication. They stream CDC events from their MySQL payment databases to BigQuery for real-time fraud analytics and compliance reporting. Their CDC pipeline processes millions of payment events daily with exactly-once semantics, ensuring that financial records in BigQuery are always consistent with the source. They contributed extensively to the Debezium open-source project.

Trade-Offs
AspectDescription
Log-Based CDC vs Query-Based CDCLog-based CDC (Debezium reading WAL/binlog) captures every change including deletes, maintains ordering, and adds zero query load to the source. Query-based CDC (periodic SELECT WHERE updated_at > last_run) is simpler to set up but misses deletes, misses rapid updates between polls, loads the source database, and requires a reliable timestamp column on every table. Log-based is strictly superior for production use.
Latency vs ThroughputCDC can deliver changes within milliseconds of the commit, but very high-frequency sources (millions of writes/second) may cause Kafka topic lag if the CDC connector cannot keep up. Debezium's throughput depends on the source database's log format and the connector's batch size. Tuning batch size trades latency (larger batches = more delay) for throughput (larger batches = fewer Kafka produce calls).
Source Database Compatibility vs Operational ComplexityEnabling CDC requires source database configuration: PostgreSQL needs logical replication slots (wal_level=logical), MySQL needs binlog enabled (binlog_format=ROW). Replication slots in PostgreSQL can cause WAL accumulation if the consumer falls behind, potentially filling the disk. Monitoring replication slot lag and setting max_slot_wal_keep_size are essential operational tasks.
Full Row vs Changed Columns OnlyDebezium can emit the full row state (before and after) or just the changed columns. Full row state simplifies downstream processing (each event is self-contained) but increases event size and Kafka storage. Changed-columns-only reduces bandwidth but requires the consumer to maintain state and apply deltas. Most production deployments use full row state for simplicity.
Case Study

Zalando's Real-Time Data Platform with CDC

Scenario

Zalando, Europe's largest online fashion platform, had over 200 microservices writing to individual PostgreSQL databases. Their data warehouse ingestion relied on batch extraction jobs running every 6 hours. This 6-hour staleness meant fraud detection missed real-time patterns, inventory dashboards showed stale stock levels, and marketing campaigns targeted users based on yesterday's behavior.

Solution

Zalando deployed Debezium CDC connectors for their top 50 PostgreSQL databases, streaming changes to Kafka topics. They built Nakadi, an event bus abstraction over Kafka, that consumed CDC events and applied schema validation. A Flink pipeline consumed these events, applied transformations, and loaded them into their Exasol data warehouse with sub-minute latency. They also used CDC to feed their Elasticsearch product search index.

Outcome

Data warehouse latency dropped from 6 hours to under 5 minutes. Fraud detection accuracy improved by 35% because the ML model now scored transactions against real-time user behavior patterns rather than stale daily snapshots. Inventory accuracy on the website improved, reducing 'out of stock after adding to cart' incidents by 40%. The CDC platform processed over 2 billion events per day.

Common Mistakes
  • Not monitoring PostgreSQL replication slot lag. When the CDC consumer falls behind (e.g., during a Kafka outage), the replication slot prevents PostgreSQL from deleting WAL segments, potentially filling the disk and crashing the database. Set max_slot_wal_keep_size and alert on slot lag exceeding a threshold.
  • Using query-based CDC (polling with WHERE updated_at > X) in production. This approach misses deletes, misses rapid intermediate updates, requires a reliable timestamp column on every table (which many legacy schemas lack), and adds query load proportional to table size. Use log-based CDC (Debezium) instead.
  • Treating CDC events as a reliable event bus without the outbox pattern. CDC captures all database changes, including those from bug fixes, migrations, and manual admin queries. If your microservices consume CDC events as domain events, they will process unintended changes. Use the outbox pattern: write explicit domain events to a dedicated outbox table, and let CDC capture only those.
  • Ignoring schema evolution compatibility. When a source column is renamed or its type changes, downstream consumers that deserialize CDC events with a fixed schema will break. Use Avro or Protobuf with a schema registry and enforce backward/forward compatibility rules.
Related Concepts

See Change Data Capture in action

Explore system design templates that use change data capture and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Stream database changes via CDC to downstream consumers

Metrics to watch
cdc_lag_msevent_throughput_rpsschema_evolution_errorsconsumer_lag
Run Simulation
Test Your Understanding

1Why is log-based CDC (reading the WAL/binlog) preferred over query-based CDC (polling with SQL)?

2What is the outbox pattern, and what problem does it solve with CDC?

Deeper Reading