Vetora logo
📋Storage Engines

Write-Ahead Log (WAL)

The Write-Ahead Log (WAL) is the fundamental crash-recovery mechanism used by virtually every database. The core rule is simple: log the change to durable storage BEFORE applying it to the actual data pages. This ensures that committed transactions can always be recovered after a crash, even if the data pages were not yet written to disk.

Overview

The Write-Ahead Log is the single most important mechanism for ensuring database durability and crash recovery. The principle is deceptively simple: before any change is applied to data pages (the actual tables and indexes on disk), a log record describing that change must first be written to a durable, append-only log file. If the system crashes at any point -- during the write, after the log but before the data page update, or even during recovery itself -- the WAL contains enough information to reconstruct the correct state. This guarantee, known as the WAL protocol, is the foundation upon which every transactional database builds its ACID durability promise.

A WAL record typically contains a Log Sequence Number (LSN) that uniquely orders all log entries, the transaction ID identifying which transaction generated this change, the page or row identifier being modified, and before-images and after-images of the changed data. The before-image is the data as it existed before the change (used for undo/rollback), and the after-image is the data after the change (used for redo/recovery). Some systems (like PostgreSQL) use physiological logging, which combines physical page identifiers with logical operation descriptions, reducing log volume while maintaining the ability to redo operations correctly even on pages that have been reorganized. The WAL is strictly append-only and written sequentially, making it extremely fast on both HDDs and SSDs because sequential writes avoid seek overhead.

Crash recovery using the WAL follows a well-defined protocol. When the database restarts after a crash, the recovery process has two phases: redo and undo. The redo phase scans the WAL forward from the last checkpoint and replays all logged changes, whether committed or not, to bring data pages up to date. This is necessary because committed transactions may have had their WAL records fsynced to disk but their dirty data pages may not have been flushed yet (the 'steal' policy allows dirty pages to be evicted). The undo phase then rolls back any changes from transactions that were active at the time of the crash and never committed, using the before-images in the WAL records. After recovery, the database is in the exact state it would have been in if all committed transactions' changes had been applied and all uncommitted transactions had been rolled back.

Checkpoints are the mechanism that bounds recovery time. Without checkpoints, recovery would need to replay the entire WAL from the beginning of time. A checkpoint forces all dirty data pages to be flushed to disk, then records a checkpoint record in the WAL with the LSN of the oldest dirty page. Recovery only needs to start from this checkpoint LSN, not from the beginning. Fuzzy checkpoints (used by most modern databases) allow the system to continue processing transactions during the checkpoint flush, recording the set of dirty pages at checkpoint start and ensuring they are all flushed by checkpoint end. The trade-off is clear: more frequent checkpoints reduce recovery time but increase I/O overhead during normal operation. Most production databases configure checkpoint intervals of 5-15 minutes, limiting recovery time to that window.

Beyond crash recovery, the WAL serves as the backbone for database replication. PostgreSQL's streaming replication works by shipping WAL segments (typically 16MB files) from the primary to standby servers, which replay them to maintain an identical copy of the database. MySQL's binary log serves a similar purpose, recording logical or row-level changes that replicas apply in order. This dual use of WAL -- for both crash recovery and replication -- makes it the most critical component in any database deployment. Group commit is the key optimization for WAL throughput: instead of issuing an fsync for every individual transaction commit, the database batches multiple concurrent commits into a single fsync call, amortizing the expensive disk flush across dozens or hundreds of transactions. This can improve commit throughput by 10-50x under concurrent workloads.

Key Points
  • 1The WAL rule: log the change to durable storage BEFORE modifying data pages. This ensures committed transactions survive any crash, because the log can always be replayed to reconstruct the correct state.
  • 2WAL records contain an LSN (ordering), transaction ID, page/row identifier, and before/after images. Before-images support undo (rollback), after-images support redo (recovery). The log is strictly append-only and written sequentially.
  • 3Recovery has two phases: redo (replay all WAL records from the last checkpoint to bring pages up to date) and undo (roll back changes from transactions that were active but uncommitted at crash time). This ARIES-style protocol handles any crash scenario.
  • 4Checkpoints flush dirty data pages to disk and record a WAL position, bounding the amount of WAL that must be replayed during recovery. More frequent checkpoints mean faster recovery but higher I/O overhead during normal operation.
  • 5Group commit batches multiple transaction commits into a single fsync, amortizing the expensive disk flush. This can improve throughput by 10-50x under concurrent workloads, making it essential for high-throughput OLTP systems.
  • 6WAL enables replication: PostgreSQL streams WAL segments to standbys, MySQL ships binary log events to replicas. The same sequential log used for crash recovery becomes the replication stream, providing a unified mechanism for durability and high availability.
Simple Example

The Accountant's Double-Entry Journal

Imagine an accountant managing a large ledger with thousands of account balances. Before changing any balance in the ledger, the accountant first writes the change in a sequential journal: 'Transfer $500 from Account A to Account B -- Account A was $2000, now $1500; Account B was $3000, now $3500.' Only after the journal entry is written in ink does the accountant update the actual ledger pages. If the accountant is interrupted mid-update (a crash), the journal shows exactly which changes were recorded but not yet applied to the ledger. A colleague can read the journal, check which ledger pages match the journal entries, and complete any unfinished updates or undo any partial changes. The journal is the WAL; the ledger pages are the data files.

Real-World Examples

PostgreSQL

PostgreSQL's WAL (stored in pg_wal/ directory) uses physiological logging with 8KB WAL records that reference pages by block number and describe logical operations within each page. WAL segments are 16MB files by default. PostgreSQL uses WAL for both crash recovery and streaming replication -- standby servers receive WAL segments and replay them continuously. The wal_level setting controls how much detail is logged: 'minimal' for crash recovery only, 'replica' for streaming replication, and 'logical' for logical replication with row-level change decoding.

SQLite

SQLite offers WAL mode as an alternative to its default rollback journal. In WAL mode, changes are appended to a separate WAL file while readers continue reading from the original database file, enabling concurrent readers and a single writer without blocking each other. WAL mode checkpoints periodically transfer changes from the WAL file back to the main database file. This design makes SQLite WAL mode ideal for read-heavy applications because readers never block writers and writers never block readers.

etcd

etcd, the distributed key-value store used by Kubernetes for cluster state, uses a WAL as the persistent log for its Raft consensus protocol. Every proposed state change is first written to the WAL, then applied to the in-memory BoltDB key-value store. The WAL entries are the Raft log entries -- they are replicated to a quorum of nodes before being committed. This makes etcd's WAL both a crash-recovery mechanism and a consensus/replication mechanism, combining the two roles that databases traditionally handle with separate subsystems.

Trade-Offs
AspectDescription
Durability vs Write LatencyEach commit requires an fsync of the WAL to guarantee durability, and fsync latency is typically 1-10ms on SSDs and 5-15ms on HDDs. Group commit amortizes this across multiple transactions, but a single-threaded workload with one transaction at a time will be limited to 100-1000 commits per second. Relaxing durability (fsync every N seconds instead of every commit) increases throughput but risks losing the last N seconds of data on crash.
Recovery Time vs Checkpoint FrequencyMore frequent checkpoints reduce recovery time (less WAL to replay) but increase I/O overhead during normal operation because dirty pages must be flushed more often. A 5-minute checkpoint interval means recovery replays at most 5 minutes of WAL, but the periodic flush of dirty pages can cause latency spikes. Some systems use asynchronous or fuzzy checkpoints to spread the I/O load.
WAL Size vs Replication LagRetaining more WAL segments allows replicas to fall further behind and catch up without requiring a full base backup, but consumes disk space on the primary. PostgreSQL's wal_keep_size and replication slots manage this trade-off. Setting retention too low causes replicas to lose their replication position; setting it too high wastes primary disk space.
Full-Page Writes vs Log VolumeAfter a checkpoint, the first modification to each page writes the entire page image to the WAL (full-page write) to protect against torn pages -- partial page writes caused by a crash during a multi-sector page write. This can double WAL volume but is essential for correctness on file systems that do not guarantee atomic page writes. Some file systems (ZFS) or hardware (battery-backed write cache) make this unnecessary.
Case Study

PostgreSQL Streaming Replication -- WAL as the Replication Protocol

Scenario

A SaaS platform running PostgreSQL needed high availability with automatic failover and read replicas for scaling read traffic. The primary database handled 10,000 writes per second, and any data loss during failover was unacceptable. The team needed a replication solution that was tightly integrated with the database's crash recovery mechanism to ensure that replicas maintained exactly the same state as the primary without complex application-level change capture.

Solution

The team deployed PostgreSQL streaming replication, which reuses the WAL as the replication protocol. The primary server's WAL sender process streams WAL records to standby servers in real time as they are generated, rather than waiting for complete WAL segments. Standbys run in continuous recovery mode, applying received WAL records to maintain an up-to-date copy. Synchronous replication was enabled for one standby (synchronous_commit = on) to guarantee zero data loss -- the primary waits for the synchronous standby to acknowledge each WAL write before confirming the commit to the client. Additional asynchronous standbys served read traffic with sub-second replication lag.

Outcome

The system achieved zero data loss during primary failover by promoting the synchronous standby, which had received all committed WAL records. Recovery time during failover was under 30 seconds, limited by DNS propagation and connection pool drainage. Read replicas handled 80% of read queries, reducing primary load by 4x. WAL-based replication consumed minimal primary CPU overhead (less than 5%) because WAL records are generated anyway for crash recovery -- streaming them to standbys is nearly free. The unified WAL mechanism meant the team did not need separate crash-recovery and replication systems, simplifying operations significantly.

Common Mistakes
  • Disabling fsync for performance. Setting fsync=off in PostgreSQL or innodb_flush_log_at_trx_commit=0 in MySQL dramatically increases throughput but means committed transactions can be lost during a crash. This is acceptable only for ephemeral data that can be reconstructed, never for data-of-record.
  • Ignoring WAL disk performance. The WAL is on the critical write path -- every commit must wait for the WAL fsync. Placing WAL on a slow disk or sharing it with data files creates a bottleneck. Best practice is to place WAL on a dedicated, fast SSD or NVMe volume with no other I/O contention.
  • Setting checkpoint intervals too long to reduce I/O. While infrequent checkpoints reduce foreground I/O, they increase crash recovery time (more WAL to replay) and WAL disk usage (more segments to retain). A 60-minute checkpoint interval means up to 60 minutes of WAL replay during recovery, which may violate RTO requirements.
  • Forgetting to size WAL retention for replication needs. If WAL segments are recycled before a replica can consume them, the replica loses its replication position and requires a full base backup to resync. Use replication slots (PostgreSQL) or binary log retention (MySQL) to prevent premature WAL deletion when replicas exist.
Related Concepts

See Write-Ahead Log (WAL) in action

Explore system design templates that use write-ahead log (wal) and run traffic simulations to see how these concepts perform under real load.

Browse Templates
Test Your Understanding

1What is the fundamental rule of Write-Ahead Logging?

2What is the purpose of a checkpoint in WAL-based recovery?

3How does group commit improve WAL-based transaction throughput?

Deeper Reading