1Why is log-based CDC (reading the WAL/binlog) preferred over query-based CDC (polling with SQL)?
Change Data Capture (CDC) tracks row-level changes (inserts, updates, deletes) in a source database and propagates them downstream in real time. Log-based CDC reads the database's transaction log (WAL, binlog) to capture every change without impacting source performance, enabling real-time data replication, event-driven architectures, and streaming ETL.
Most data in organizations lives in operational databases (PostgreSQL, MySQL, MongoDB). Getting that data to analytical systems (data warehouses, search indexes, caches) has traditionally required periodic batch extraction -- a nightly SQL query that dumps the table. This approach is slow (up to 24 hours of latency), incomplete (misses intermediate states and deletes), and burdensome on the source database (full table scans compete with production traffic).
**Change Data Capture** solves all three problems by reading the database's own transaction log. Every relational database maintains a log of all committed changes for crash recovery and replication: PostgreSQL's Write-Ahead Log (WAL), MySQL's binary log (binlog), MongoDB's oplog, SQL Server's transaction log. CDC tools read this log and emit a stream of change events: each event describes what changed (the before and after state of a row), when it changed, and in what transaction.
**Debezium** is the most widely adopted open-source CDC platform. It runs as a set of Kafka Connect source connectors -- one per source database. Debezium reads the transaction log, converts each change into a structured event (JSON or Avro), and publishes it to a Kafka topic (one topic per table by default). Downstream consumers (Flink, Spark, Kafka Connect sinks, custom applications) subscribe to these topics to react to changes in real time.
CDC has become foundational for modern data architectures. It replaces batch ETL extraction with real-time streaming extraction. It enables the **outbox pattern** for reliable microservice event publishing. It powers real-time cache invalidation (invalidate only the changed records), search index updates (update Elasticsearch without polling), and materialized view maintenance. CDC turns your database into an event stream without changing application code.
Real-Time Product Search Index
An e-commerce platform has a products table in PostgreSQL with 10 million rows. Whenever a merchant updates a product's price, description, or availability, the change must be reflected in the Elasticsearch search index within seconds. Instead of polling the products table every minute (which loads the database and misses rapid changes), Debezium reads the PostgreSQL WAL and publishes product change events to a Kafka topic. A Kafka Connect Elasticsearch sink connector consumes these events and updates the search index in real time. When a merchant marks a product as out-of-stock, the Elasticsearch document is updated within 2 seconds, and customers immediately stop seeing it in search results. No application code changes were needed -- CDC reads the log transparently.
Airbnb
Airbnb uses Debezium-based CDC to stream changes from their MySQL databases to their Hudi data lake on S3. Previously, nightly batch extraction meant the data lake was always 12-24 hours stale. With CDC, the data lake reflects source changes within minutes. This enabled real-time pricing analytics: hosts can see how their pricing compares to similar listings in near-real-time rather than waiting for yesterday's batch.
Shopify
Shopify processes over 10 billion CDC events daily using their internal CDC platform built on MySQL binlog readers. Every change to a shop, product, order, or customer record is captured and streamed to their event bus. This powers real-time search indexing (product updates reflected in Shopify's storefront search within seconds), real-time analytics, and data warehouse replication. CDC replaced hundreds of custom ETL jobs that polled MySQL tables.
Wepay (JPMorgan Chase)
Wepay pioneered the use of Debezium in production for payment data replication. They stream CDC events from their MySQL payment databases to BigQuery for real-time fraud analytics and compliance reporting. Their CDC pipeline processes millions of payment events daily with exactly-once semantics, ensuring that financial records in BigQuery are always consistent with the source. They contributed extensively to the Debezium open-source project.
| Aspect | Description |
|---|---|
| Log-Based CDC vs Query-Based CDC | Log-based CDC (Debezium reading WAL/binlog) captures every change including deletes, maintains ordering, and adds zero query load to the source. Query-based CDC (periodic SELECT WHERE updated_at > last_run) is simpler to set up but misses deletes, misses rapid updates between polls, loads the source database, and requires a reliable timestamp column on every table. Log-based is strictly superior for production use. |
| Latency vs Throughput | CDC can deliver changes within milliseconds of the commit, but very high-frequency sources (millions of writes/second) may cause Kafka topic lag if the CDC connector cannot keep up. Debezium's throughput depends on the source database's log format and the connector's batch size. Tuning batch size trades latency (larger batches = more delay) for throughput (larger batches = fewer Kafka produce calls). |
| Source Database Compatibility vs Operational Complexity | Enabling CDC requires source database configuration: PostgreSQL needs logical replication slots (wal_level=logical), MySQL needs binlog enabled (binlog_format=ROW). Replication slots in PostgreSQL can cause WAL accumulation if the consumer falls behind, potentially filling the disk. Monitoring replication slot lag and setting max_slot_wal_keep_size are essential operational tasks. |
| Full Row vs Changed Columns Only | Debezium can emit the full row state (before and after) or just the changed columns. Full row state simplifies downstream processing (each event is self-contained) but increases event size and Kafka storage. Changed-columns-only reduces bandwidth but requires the consumer to maintain state and apply deltas. Most production deployments use full row state for simplicity. |
Zalando's Real-Time Data Platform with CDC
Scenario
Zalando, Europe's largest online fashion platform, had over 200 microservices writing to individual PostgreSQL databases. Their data warehouse ingestion relied on batch extraction jobs running every 6 hours. This 6-hour staleness meant fraud detection missed real-time patterns, inventory dashboards showed stale stock levels, and marketing campaigns targeted users based on yesterday's behavior.
Solution
Zalando deployed Debezium CDC connectors for their top 50 PostgreSQL databases, streaming changes to Kafka topics. They built Nakadi, an event bus abstraction over Kafka, that consumed CDC events and applied schema validation. A Flink pipeline consumed these events, applied transformations, and loaded them into their Exasol data warehouse with sub-minute latency. They also used CDC to feed their Elasticsearch product search index.
Outcome
Data warehouse latency dropped from 6 hours to under 5 minutes. Fraud detection accuracy improved by 35% because the ML model now scored transactions against real-time user behavior patterns rather than stale daily snapshots. Inventory accuracy on the website improved, reducing 'out of stock after adding to cart' incidents by 40%. The CDC platform processed over 2 billion events per day.
See Change Data Capture in action
Explore system design templates that use change data capture and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is log-based CDC (reading the WAL/binlog) preferred over query-based CDC (polling with SQL)?
2What is the outbox pattern, and what problem does it solve with CDC?