Vetora logo
📤Messaging & Streaming

Outbox Pattern

The Outbox Pattern ensures reliable message publishing by writing the message to an 'outbox' table in the same database transaction as the business operation. A separate process reads the outbox table and publishes messages to the message broker. This eliminates the dual-write problem where a database commit succeeds but the message publish fails (or vice versa), ensuring atomicity between state changes and event publication.

Overview

Consider a common scenario: an order service creates an order in its database and publishes an 'OrderCreated' event to Kafka. These are two separate operations -- a database write and a message broker publish. What happens if the database write succeeds but the Kafka publish fails? The order exists but downstream services never learn about it. What if the Kafka publish succeeds but the database write fails? Downstream services process a phantom order that does not exist.

This is the **dual-write problem**: writing to two separate systems (database + broker) cannot be made atomic without distributed transactions, which are slow, complex, and often unavailable. The Outbox Pattern solves this by reducing the dual-write to a single-write:

1. **Write phase**: In a single database transaction, the service writes the business entity (orders table) AND the outgoing event (outbox table). The outbox row contains the event type, payload, timestamp, and a status flag ('pending').

2. **Relay phase**: A separate process reads pending outbox rows and publishes them to the message broker. After successful publication, it marks the row as 'published' (or deletes it). If the relay crashes, it restarts and re-reads pending rows -- the outbox table is the source of truth.

Two relay strategies exist: - **Polling publisher**: A background thread or cron job queries the outbox table for pending rows at regular intervals (e.g., every 100ms). Simple but adds latency equal to the polling interval. - **Change Data Capture (CDC)**: A tool like Debezium tails the database's transaction log (WAL in PostgreSQL, binlog in MySQL) and publishes new outbox rows to Kafka in near-real-time. Lower latency, no polling overhead, but requires CDC infrastructure.

The outbox table is a lightweight transactional log within your existing database. It leverages the database's ACID guarantees to ensure that the business operation and the event publication are atomic. The trade-off is eventual delivery: the event reaches the broker after a short delay (milliseconds with CDC, up to seconds with polling).

This pattern is fundamental to event-driven microservices. It is described by Chris Richardson in his Microservices Patterns book and implemented by Debezium, Confluent's Outbox SMT, and AWS DMS.

Key Points
  • 1The dual-write problem: writing to a database AND a message broker is not atomic. If one succeeds and the other fails, the system is inconsistent. The Outbox Pattern eliminates this by using a single database transaction.
  • 2The outbox table stores pending events alongside business data. Both are written in the same transaction. The database's ACID guarantees ensure atomicity.
  • 3Two relay strategies: polling (simple, adds latency) and CDC (near-real-time, requires Debezium or similar). CDC tails the database's transaction log and publishes inserts to the outbox table.
  • 4Events are published at-least-once: if the relay crashes after publishing but before marking as published, it republishes on restart. Consumers must be idempotent.
  • 5The outbox table should be cleaned up periodically: delete published rows older than the retention period. Without cleanup, the table grows unbounded.
  • 6Debezium's outbox event router is the most popular CDC-based implementation. It reads the PostgreSQL WAL or MySQL binlog, extracts outbox inserts, and publishes them to Kafka topics with configurable routing.
Simple Example

Order Service Publishes Events Reliably

The order service receives a CreateOrder request. In a single database transaction, it: (1) INSERTs a row into the orders table, and (2) INSERTs a row into the outbox table with {event_type: 'OrderCreated', payload: {order_id, items, total}, status: 'pending'}. The transaction commits atomically. A Debezium connector tails the PostgreSQL WAL, detects the outbox insert, publishes the event to the 'orders' Kafka topic, and the outbox row is eventually deleted. If Debezium is down, the outbox rows accumulate safely in the database and are published when Debezium recovers.

Real-World Examples

Zalando

Zalando's open-source Nakadi event bus uses the outbox pattern across hundreds of microservices. Each service writes events to a local outbox table in the same transaction as business updates. A per-service publisher reads the outbox and publishes to Nakadi (their Kafka-based event bus). This ensures that every database change is reflected as an event, enabling reliable event-driven architecture across the company.

Debezium (Red Hat)

Debezium is the most widely used CDC platform for implementing the outbox pattern. Its 'outbox event router' SMT (Single Message Transform) extracts events from the outbox table, routes them to Kafka topics based on event type, and supports event deduplication via event IDs. Debezium tails PostgreSQL WAL, MySQL binlog, MongoDB oplog, and SQL Server CDC to achieve near-real-time event publishing.

Wix

Wix processes millions of website updates daily using the outbox pattern. When a user edits their website, the change is saved to the database with an outbox entry in the same transaction. A CDC pipeline publishes the change event to Kafka, which triggers CDN invalidation, search reindexing, and real-time collaboration updates. The outbox pattern ensures no edit is ever lost, even during broker outages.

Trade-Offs
AspectDescription
Consistency vs ComplexityThe outbox pattern adds a table, a relay process, and cleanup logic. This is more complex than a naive dual-write (publish after commit). But the naive approach is fundamentally unreliable -- the complexity of the outbox pattern is the price of correctness.
Latency vs SimplicityPolling adds latency equal to the polling interval (100ms-1s). CDC adds infrastructure complexity (Debezium cluster, WAL configuration, connector management) but achieves near-real-time delivery. Choose based on your latency requirements and operational maturity.
Database LoadEvery business transaction includes an extra INSERT into the outbox table. Under high throughput, this adds write amplification. The outbox table must be indexed for the relay query (status + created_at). Periodic cleanup (DELETE published rows) adds maintenance I/O. For most workloads, this overhead is negligible.
At-Least-Once vs Exactly-OnceThe outbox pattern guarantees at-least-once publication: if the relay crashes after publishing but before marking as published, it republishes. Consumers must handle duplicates. For exactly-once end-to-end, combine the outbox pattern with consumer-side idempotency keys.
Case Study

How Debezium Solved the Dual-Write Problem for Microservices

Scenario

A large e-commerce platform migrated from a monolith to microservices. Each service owned its database but needed to publish domain events for inter-service communication. Teams implemented the naive approach: commit to database, then publish to Kafka. Under load, 0.1% of publishes failed (Kafka timeouts, producer errors), creating 'ghost state' -- database records with no corresponding Kafka events. Downstream services (search, recommendations, analytics) had missing data. Reconciliation jobs ran nightly but could not fix real-time inconsistencies.

Solution

The platform adopted Debezium's outbox pattern. Each service writes to a local outbox table in the same transaction as the business update. Debezium connectors (one per database) tail the WAL and publish outbox events to Kafka. The outbox table schema: (id UUID, aggregateType VARCHAR, aggregateId VARCHAR, type VARCHAR, payload JSONB, created_at TIMESTAMP). Debezium's outbox event router SMT extracts and routes events based on aggregateType.

Outcome

Ghost state dropped from 0.1% to effectively zero. Nightly reconciliation jobs were eliminated. Event publishing latency was under 100ms (WAL tail to Kafka delivery). The platform processes 50M events/day through the outbox pipeline with 99.99% reliability. The Debezium connector cluster requires minimal operational overhead (3 tasks, automated offset tracking). The key lesson: 'If your service writes to a database and a broker, you have a dual-write problem. The outbox pattern is the standard solution.'

Common Mistakes
  • Publishing to the broker first, then writing to the database. If the database write fails, the broker has a phantom event that does not correspond to any real state. Always write to the database (with outbox) first.
  • Not cleaning up the outbox table. Without periodic deletion of published rows, the outbox table grows unbounded, degrading query performance and consuming disk space. Use a cleanup job or a TTL-based partition strategy.
  • Using the outbox pattern without idempotent consumers. The relay may publish duplicates (crash after publish, before marking as sent). Consumers must deduplicate using the event ID or an idempotency key.
  • Putting too much data in the outbox payload. The outbox row is written in the same transaction as the business data. Large payloads (e.g., full document blobs) slow down the transaction. Keep payloads small and reference large data by ID.
Related Concepts

See Outbox Pattern in action

Explore system design templates that use outbox pattern and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Ensure reliable event publishing with the transactional outbox

Metrics to watch
event_delivery_rateoutbox_lag_msduplicate_event_pcttransaction_success_rate
Run Simulation
Test Your Understanding

1What is the core problem the Outbox Pattern solves?

2Why must consumers be idempotent when using the Outbox Pattern?

Deeper Reading