Production-grade S3-equivalent: Reed-Solomon (10+4) erasure coding for 11-nines durability at 1.4x storage overhead, Raft consensus for strongly consistent metadata, multipart upload, copy-on-write versioning, lifecycle tiering (standard/IA/glacier), and multi-region replication for disaster recovery.
The erasure-coded multi-region architecture represents the production-grade solution to distributed file storage — the architecture behind Amazon S3, Google Cloud Storage, and Azure Blob Storage. It solves two critical limitations of the metadata-plus-replication approach (v1): the 3x storage overhead of replication and the single-region failure domain.
Reed-Solomon erasure coding is the key innovation. Instead of storing 3 complete copies of every block (3x overhead), the system splits each block into 10 data shards and computes 4 parity shards using Reed-Solomon coding — 14 total shards at 1.4x storage overhead. Any 10 of the 14 shards can reconstruct the original data. This means the system tolerates up to 4 simultaneous shard losses (node failures, disk corruption) while maintaining full data availability — superior fault tolerance to 3x replication (which tolerates only 2 failures) at less than half the storage cost.
The numbers make the case compelling. For 1 PB of data: 3x replication requires 3 PB of raw storage at approximately $69,000/month (at $0.023/GB). Erasure coding (10+4) requires only 1.4 PB at approximately $32,200/month — a 53% cost reduction with better fault tolerance. At exabyte scale, this difference is hundreds of millions of dollars annually. Every major cloud provider uses erasure coding for their object storage tier.
The metadata layer uses a 5-node Raft consensus cluster instead of a single database. Raft provides strongly consistent metadata operations even during node failures: the leader processes all writes, replicating to at least 2 followers before acknowledging. This survives up to 2 metadata node failures while maintaining strong consistency. The previous variant's single MetadataDB was a potential single point of failure for metadata; Raft eliminates this vulnerability.
Multipart upload solves the large-object problem. Instead of uploading a 10 GB file as a single monolithic request (fragile, not resumable), the client splits it into parts (64 MB each) that are uploaded independently. Each part is erasure-coded and stored on ChunkServers. Parts can be uploaded in parallel and resumed on failure — only the failed part needs retransmission. The server stitches parts into a complete object only after all parts are verified.
Copy-on-write versioning provides data protection and audit compliance. When an object is overwritten, the previous version's shards are not deleted — a new version with new shards is created, and metadata tracks all versions. This enables instant rollback, audit trails, and protection against accidental deletion. Lifecycle policies automatically delete old versions after a configurable retention period to control storage growth.
Lifecycle tiering automatically moves cold data to cheaper storage. Objects not accessed for 30 days transition to infrequent-access (IA) tier at 40% lower cost. After 90 days without access, they move to cold storage (S3 Glacier equivalent) at 90% lower cost. A 1 PB dataset with typical access patterns costs approximately $6,000/month with lifecycle tiering vs $23,000/month on standard tier — a 74% reduction. The LifecycleWorker continuously evaluates policies and executes transitions.
Multi-region replication provides disaster recovery with RPO under 1 minute. Every write publishes a replication event to Kafka, consumed by a cross-region replicator that copies shards and metadata to the DR region. A full region outage (rare but possible) is survivable with automatic DNS failover to the DR region in approximately 5 minutes.
This is the architecture interviewers expect senior candidates to arrive at after discussing the naive and replication approaches. The progression from single-server to distributed replication to erasure coding demonstrates deepening understanding of storage trade-offs.
The erasure-coded system uses 11 components organized into metadata, encoding, storage, replication, and lifecycle management tiers. The key additions over the replication-based v1 variant are the ErasureEncoder (Reed-Solomon codec), Raft consensus MetadataDB, IndexCache (replacing MetadataCache), ReplicationStream (cross-region Kafka), LifecycleWorker, and ColdStore (glacier-equivalent archive).
The request path starts at the ApiGateway, which authenticates IAM signatures (approximately 3ms) and routes through MainLB to ObjectService. The ObjectService is the orchestration hub: it coordinates metadata operations (IndexCache and MetadataDB), encoding operations (ErasureEncoder), storage operations (ChunkServers), and replication (ReplicationStream). It runs on 50 pods with 200 threads each for 625K sustained RPS. The ApiGateway supports 3M RPS capacity with rate limiting, providing substantial headroom above the 2M peak.
The write path is a multi-step pipeline. ObjectService receives the PUT request, sends data to ErasureEncoder which produces 14 shards (10 data + 4 parity), writes all 14 shards to 14 different ChunkServers in parallel, commits metadata to MetadataDB via Raft leader (replicated to 2+ followers before ack), updates IndexCache for fast subsequent lookups, and publishes a replication event to ReplicationStream. For objects larger than 64 MB, multipart upload is triggered automatically — each part goes through this pipeline independently. The write completes only after the Raft leader confirms the metadata commit, ensuring durability before the client receives a success response.
The read path is optimized for speed. ObjectService checks IndexCache for the shard map (94% hit rate, 2ms). On cache miss, it queries MetadataDB Raft leader for a strongly consistent read (approximately 15ms). With the shard map, it fetches any 10 of the 14 shards from ChunkServers in parallel (approximately 15ms). If all 10 data shards are available, ErasureEncoder concatenates them (approximately 1ms — no decode needed). If some data shards are missing or corrupted, parity shards enable reconstruction via Reed-Solomon decode (5-15ms depending on missing count). The system is resilient to up to 4 simultaneous shard failures per object without any data loss or read degradation.
The MetadataDB is a 5-node PostgreSQL Raft cluster with 256 partitions. The leader handles all writes, replicating to at least 2 followers before acknowledging (majority quorum). This provides strong consistency while tolerating up to 2 node failures. Leader election takes 1-5 seconds on leader failure, during which writes are unavailable but reads can be served from followers with slightly stale data.
IndexCache is a 16-node Redis cluster caching object-to-shard mappings at approximately 150 bytes per entry. With 94% hit rate, it reduces MetadataDB load by 16x. Write-through on PUT ensures read-after-write consistency. The cache uses a 3600-second TTL with LRU eviction, and the single-flight pattern prevents thundering herd on concurrent misses for the same key.
ChunkServers store erasure-coded shards across 256 partitions distributed via consistent hashing. Shards are immutable — overwrites create new shard sets (copy-on-write). Two storage tiers are supported: standard (io2 EBS, high IOPS for frequently accessed data) and infrequent-access (gp3, lower IOPS at reduced cost for data not accessed in 30+ days). Per-shard SHA-256 checksums are verified on every read to detect silent data corruption (bit rot).
ReplicationStream is a 128-partition Kafka cluster handling cross-region replication events. Every write publishes object metadata and shard references. A cross-region consumer replicates to the DR region for disaster recovery (RPO < 1 minute). The Kafka topic is partitioned by object_key to maintain per-object ordering of operations.
LifecycleWorker continuously evaluates lifecycle policies with 20 worker instances processing approximately 5K transitions per second. It transitions cold objects from standard to IA (30 days without access) and from IA to ColdStore (90 days). ColdStore is an S3 Glacier-equivalent archive with retrieval times ranging from minutes (expedited) to hours (bulk). The worker also handles garbage collection of orphaned shards from deleted objects and abandoned multipart uploads, reclaiming storage that would otherwise leak indefinitely.
Choice
Split each block into 10 data shards + 4 parity shards instead of 3x replication
Rationale
Erasure coding achieves 11-nines durability at 1.4x storage overhead versus 3x with replication — a 53% storage cost reduction. For 1 PB of data, this saves approximately $36,800/month. The trade-off is decode latency: reads require fetching 10 shards from different nodes and performing Reed-Solomon decode (5-15ms), compared to reading a single replica (2ms). For cost-sensitive storage at scale, this trade-off is overwhelmingly favorable.
Choice
5-node PostgreSQL Raft cluster instead of single DynamoDB instance
Rationale
The v1 variant's single MetadataDB is a potential single point of failure for metadata operations. Raft consensus replicates every metadata write to at least 2 followers before acknowledging, providing strong consistency and surviving up to 2 node failures. Write latency increases by approximately 10ms for the consensus overhead, but metadata writes are already I/O-bound, so this is acceptable for the fault tolerance gained.
Choice
Split large objects into parts, upload independently, stitch server-side
Rationale
A 10 GB single-request upload is fragile — any network interruption requires full restart. Multipart upload splits it into 64 MB parts: each uploaded independently, resumable on failure, uploadable in parallel. Only the failed part needs retransmission. The server stitches parts after all are verified. This is essential for production file systems handling terabyte-scale objects over unreliable networks.
Choice
Overwrites create new shard sets; previous versions preserved until lifecycle expiry
Rationale
When an object is overwritten, the previous version's shards remain intact. A new version with new shards is created, and metadata tracks the version chain. This enables instant rollback (point to previous version's shards), audit compliance (every version is immutable), and protection against accidental deletion. Storage cost grows with version count, managed by lifecycle policies that auto-delete old versions.
Choice
Automatically move cold data to cheaper storage tiers based on access patterns
Rationale
80% of stored data is rarely accessed after 30 days. Lifecycle tiering moves it from standard storage ($0.023/GB/month) to infrequent-access ($0.014/GB/month) and eventually to glacier ($0.004/GB/month). A 1 PB dataset with 80% cold data costs approximately $6,000/month with tiering versus $23,000/month on standard — a 74% cost reduction. The LifecycleWorker evaluates policies continuously and executes transitions asynchronously.
Choice
Every write publishes to ReplicationStream; cross-region consumer replicates to DR region
Rationale
A full region outage (rare but possible) would cause data unavailability in the single-region v1 approach. Cross-region replication maintains a hot standby in the DR region with RPO under 1 minute (Kafka consumer lag). RTO is approximately 5 minutes (DNS failover plus Raft leader election in DR region). The cost is substantial — doubled storage plus $0.02/GB data transfer — but necessary for mission-critical data.
Target RPS
2M+ ops/sec (metadata + data I/O)
Latency (p99)
<50ms small object reads, <150ms writes (Raft + erasure)
Storage
Exabytes (erasure-coded, lifecycle-tiered)
Availability
99.999% (multi-region, Raft consensus)
| Operation | Time | Space | Notes |
|---|---|---|---|
| GET object (small, cache hit, no decode needed) | O(1) cache + O(10) parallel shard reads + O(1) concatenation | O(1.4 * B) shard data fetched, O(B) reconstructed object | Best case: all 10 data shards available, no parity decode needed. Cache hit (2ms) + 10 parallel reads (15ms) + concat (1ms) = approximately 18ms. |
| GET object (cache miss, parity decode needed) | O(1) DB read + O(10) parallel shard reads + O(S) Reed-Solomon decode | O(1.4 * B) shard data + O(B) decode buffer | Worst case: cache miss (15ms) + 10 shard reads (15ms) + RS decode (15ms) = approximately 45ms. Still sub-50ms. |
| PUT object (erasure encode + Raft commit) | O(S) encode + O(14) parallel shard writes + O(1) Raft commit | O(1.4 * B) total shard storage | Encode (10ms) + 14 parallel writes (20ms) + Raft (15ms) + cache update (2ms) = approximately 50ms. Multi-region add 5ms Kafka publish. |
| Lifecycle transition (standard to glacier) | O(14) shard reads + O(1) archive write + O(1) metadata update | O(1.4 * B) temporary during transition | Background operation by LifecycleWorker. Not latency-sensitive. Approximately 100ms per object transition. |
Source of truth for object-to-shard mappings in a 5-node Raft consensus cluster. Each row maps an object key + version to its 14 shard locations, size, checksum, storage class, and lifecycle state. Strong consistency via Raft ensures read-after-write guarantees. Versioned: each PUT creates a new row with a new version_id.
256 partitions across 5 Raft nodes. Raft leader handles writes; reads go to leader for strong consistency. Log-normal latency distribution.
Stores erasure-coded shards across distributed storage nodes. Each object produces 14 shards (10 data + 4 parity) via Reed-Solomon coding. Shards are immutable — overwrites create new shard sets. Per-shard SHA-256 checksums detect bit rot on every read.
256 partitions via consistent hashing. Immutable shards — no in-place updates. Two storage tiers: standard (io2) and IA (gp3).
Cold storage archive for lifecycle-tiered objects. Shards are merged into archive bundles for efficient long-term storage. Retrieval requires an explicit restore request with configurable speed (bulk: 12 hours, standard: 5 hours, expedited: 5 minutes).
Storage cost approximately $0.004/GB/month — 83% cheaper than standard tier. Retrieval incurs additional per-GB charges.
Cached object-to-shard mappings for fast metadata lookups. 16-node Redis cluster with 94% hit rate, reducing MetadataDB Raft leader load by 16x. Write-through on PUT ensures read-after-write consistency.
Approximately 150 bytes per entry. 2B cached entries is approximately 300 GB. 3600-second TTL with LRU eviction.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| Naive (Single NFS-Like Server) | T1 | 50-300ms small files, minutes for large files | ~400 ops/sec (single disk I/O ceiling) | $300/month (single server + RDS) | Minimal — 3 components, no distribution | 99% (single disk, no redundancy) |
| Metadata + Block Storage (HDFS/S3 Pattern) | T2 | <100ms small objects, seconds for large objects | 650K ops/sec (distributed I/O) | $5,000/month (8 components, 3x replication) | Medium — metadata/data separation, repair pipeline | 99.99% (3x replication, 11-nines durability) |
| Erasure-Coded Multi-Region Storage | T3 | <50ms small objects (erasure decode adds 5-15ms) | 2M ops/sec (erasure + multi-region) | $8,000/month (11 components, 1.4x storage + cross-region) | High — Raft, erasure coding, lifecycle tiering | 99.999% (erasure coding, multi-region DR) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Reed-Solomon (10+4) splits data into 10 data shards and 4 parity shards. Any 10 of 14 shards can reconstruct the original data — the system tolerates 4 simultaneous shard losses. With 3x replication, data survives only 2 simultaneous losses. Erasure coding actually provides better fault tolerance (4 vs 2 failures) at 1.4x storage versus 3x. The mathematical guarantee comes from the Reed-Solomon algebraic structure: 4 parity shards encode enough redundant information to recover any 4 missing shards from the remaining 10.
With 3x replication, a read fetches data from the nearest single replica in approximately 2ms (one network hop, one disk read). With erasure coding, a read fetches 10 shards from 10 different ChunkServers in parallel (approximately 15ms for the slowest shard) and decodes them (1-15ms depending on whether parity reconstruction is needed). Total: approximately 2ms for replication versus 20-30ms for erasure coding. This 10-15x latency difference is why production systems use replication for hot data and erasure coding for warm or cold data.
Every metadata write must be replicated to at least 2 of 4 followers before the Raft leader acknowledges — adding approximately 10ms to write latency (two network round-trips within the cluster). During normal operation, the leader batches multiple writes into a single Raft log entry to amortize the consensus overhead. During leader elections (triggered by leader failure, lasting 1-5 seconds), writes are unavailable. The 5-node cluster tolerates up to 2 simultaneous node failures while maintaining majority quorum.
Standard storage costs $0.023/GB/month. Infrequent-access (IA) costs $0.014/GB/month (40% savings). Glacier Deep Archive costs $0.004/GB/month (83% savings). For a 1 PB dataset where 10% is hot (standard), 10% is warm (IA), and 80% is cold (glacier): standard-only cost is $23,000/month; tiered cost is $2,300 + $1,400 + $3,200 = $6,900/month — a 70% reduction. With typical access patterns and aggressive lifecycle policies, savings reach 74% or more.
A 10 GB single-request upload has a failure probability proportional to the transfer duration (minutes). Any interruption requires restarting the entire upload. Multipart upload splits the object into 160 parts of 64 MB each. Each part is uploaded independently in approximately 1 second. If a part fails, only that 64 MB part is retried — not the entire 10 GB. Parts can be uploaded in parallel (e.g., 10 at a time) for 10x throughput improvement. The server tracks in-progress multipart uploads and stitches parts after all are verified. Abandoned uploads are cleaned up by LifecycleWorker.
Every write is replicated to the DR region via ReplicationStream (Kafka) with RPO under 1 minute. On region failure: (1) monitoring detects the outage within 30 seconds, (2) DNS is updated to point to the DR region's ApiGateway within 2 minutes, (3) the DR region's Raft cluster elects a new leader within 5 seconds, (4) clients transparently failover via DNS. Total RTO is approximately 5 minutes. Data written to the primary region within the last minute of the outage (RPO window) may not be available in DR until the primary region recovers.
Sign in to join the discussion.
Ready to design your own Distributed File System?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator