Vetora logo
Hard11 componentsInterview: High

Distributed File System — Erasure-Coded Multi-Region Storage

Production-grade S3-equivalent: Reed-Solomon (10+4) erasure coding for 11-nines durability at 1.4x storage overhead, Raft consensus for strongly consistent metadata, multipart upload, copy-on-write versioning, lifecycle tiering (standard/IA/glacier), and multi-region replication for disaster recovery.

StorageErasure CodingRaft ConsensusMulti-RegionProduction-Grade
Problem Statement

The erasure-coded multi-region architecture represents the production-grade solution to distributed file storage — the architecture behind Amazon S3, Google Cloud Storage, and Azure Blob Storage. It solves two critical limitations of the metadata-plus-replication approach (v1): the 3x storage overhead of replication and the single-region failure domain.

Reed-Solomon erasure coding is the key innovation. Instead of storing 3 complete copies of every block (3x overhead), the system splits each block into 10 data shards and computes 4 parity shards using Reed-Solomon coding — 14 total shards at 1.4x storage overhead. Any 10 of the 14 shards can reconstruct the original data. This means the system tolerates up to 4 simultaneous shard losses (node failures, disk corruption) while maintaining full data availability — superior fault tolerance to 3x replication (which tolerates only 2 failures) at less than half the storage cost.

The numbers make the case compelling. For 1 PB of data: 3x replication requires 3 PB of raw storage at approximately $69,000/month (at $0.023/GB). Erasure coding (10+4) requires only 1.4 PB at approximately $32,200/month — a 53% cost reduction with better fault tolerance. At exabyte scale, this difference is hundreds of millions of dollars annually. Every major cloud provider uses erasure coding for their object storage tier.

The metadata layer uses a 5-node Raft consensus cluster instead of a single database. Raft provides strongly consistent metadata operations even during node failures: the leader processes all writes, replicating to at least 2 followers before acknowledging. This survives up to 2 metadata node failures while maintaining strong consistency. The previous variant's single MetadataDB was a potential single point of failure for metadata; Raft eliminates this vulnerability.

Multipart upload solves the large-object problem. Instead of uploading a 10 GB file as a single monolithic request (fragile, not resumable), the client splits it into parts (64 MB each) that are uploaded independently. Each part is erasure-coded and stored on ChunkServers. Parts can be uploaded in parallel and resumed on failure — only the failed part needs retransmission. The server stitches parts into a complete object only after all parts are verified.

Copy-on-write versioning provides data protection and audit compliance. When an object is overwritten, the previous version's shards are not deleted — a new version with new shards is created, and metadata tracks all versions. This enables instant rollback, audit trails, and protection against accidental deletion. Lifecycle policies automatically delete old versions after a configurable retention period to control storage growth.

Lifecycle tiering automatically moves cold data to cheaper storage. Objects not accessed for 30 days transition to infrequent-access (IA) tier at 40% lower cost. After 90 days without access, they move to cold storage (S3 Glacier equivalent) at 90% lower cost. A 1 PB dataset with typical access patterns costs approximately $6,000/month with lifecycle tiering vs $23,000/month on standard tier — a 74% reduction. The LifecycleWorker continuously evaluates policies and executes transitions.

Multi-region replication provides disaster recovery with RPO under 1 minute. Every write publishes a replication event to Kafka, consumed by a cross-region replicator that copies shards and metadata to the DR region. A full region outage (rare but possible) is survivable with automatic DNS failover to the DR region in approximately 5 minutes.

This is the architecture interviewers expect senior candidates to arrive at after discussing the naive and replication approaches. The progression from single-server to distributed replication to erasure coding demonstrates deepening understanding of storage trade-offs.

Architecture Overview

The erasure-coded system uses 11 components organized into metadata, encoding, storage, replication, and lifecycle management tiers. The key additions over the replication-based v1 variant are the ErasureEncoder (Reed-Solomon codec), Raft consensus MetadataDB, IndexCache (replacing MetadataCache), ReplicationStream (cross-region Kafka), LifecycleWorker, and ColdStore (glacier-equivalent archive).

The request path starts at the ApiGateway, which authenticates IAM signatures (approximately 3ms) and routes through MainLB to ObjectService. The ObjectService is the orchestration hub: it coordinates metadata operations (IndexCache and MetadataDB), encoding operations (ErasureEncoder), storage operations (ChunkServers), and replication (ReplicationStream). It runs on 50 pods with 200 threads each for 625K sustained RPS. The ApiGateway supports 3M RPS capacity with rate limiting, providing substantial headroom above the 2M peak.

The write path is a multi-step pipeline. ObjectService receives the PUT request, sends data to ErasureEncoder which produces 14 shards (10 data + 4 parity), writes all 14 shards to 14 different ChunkServers in parallel, commits metadata to MetadataDB via Raft leader (replicated to 2+ followers before ack), updates IndexCache for fast subsequent lookups, and publishes a replication event to ReplicationStream. For objects larger than 64 MB, multipart upload is triggered automatically — each part goes through this pipeline independently. The write completes only after the Raft leader confirms the metadata commit, ensuring durability before the client receives a success response.

The read path is optimized for speed. ObjectService checks IndexCache for the shard map (94% hit rate, 2ms). On cache miss, it queries MetadataDB Raft leader for a strongly consistent read (approximately 15ms). With the shard map, it fetches any 10 of the 14 shards from ChunkServers in parallel (approximately 15ms). If all 10 data shards are available, ErasureEncoder concatenates them (approximately 1ms — no decode needed). If some data shards are missing or corrupted, parity shards enable reconstruction via Reed-Solomon decode (5-15ms depending on missing count). The system is resilient to up to 4 simultaneous shard failures per object without any data loss or read degradation.

The MetadataDB is a 5-node PostgreSQL Raft cluster with 256 partitions. The leader handles all writes, replicating to at least 2 followers before acknowledging (majority quorum). This provides strong consistency while tolerating up to 2 node failures. Leader election takes 1-5 seconds on leader failure, during which writes are unavailable but reads can be served from followers with slightly stale data.

IndexCache is a 16-node Redis cluster caching object-to-shard mappings at approximately 150 bytes per entry. With 94% hit rate, it reduces MetadataDB load by 16x. Write-through on PUT ensures read-after-write consistency. The cache uses a 3600-second TTL with LRU eviction, and the single-flight pattern prevents thundering herd on concurrent misses for the same key.

ChunkServers store erasure-coded shards across 256 partitions distributed via consistent hashing. Shards are immutable — overwrites create new shard sets (copy-on-write). Two storage tiers are supported: standard (io2 EBS, high IOPS for frequently accessed data) and infrequent-access (gp3, lower IOPS at reduced cost for data not accessed in 30+ days). Per-shard SHA-256 checksums are verified on every read to detect silent data corruption (bit rot).

ReplicationStream is a 128-partition Kafka cluster handling cross-region replication events. Every write publishes object metadata and shard references. A cross-region consumer replicates to the DR region for disaster recovery (RPO < 1 minute). The Kafka topic is partitioned by object_key to maintain per-object ordering of operations.

LifecycleWorker continuously evaluates lifecycle policies with 20 worker instances processing approximately 5K transitions per second. It transitions cold objects from standard to IA (30 days without access) and from IA to ColdStore (90 days). ColdStore is an S3 Glacier-equivalent archive with retrieval times ranging from minutes (expedited) to hours (bulk). The worker also handles garbage collection of orphaned shards from deleted objects and abandoned multipart uploads, reclaiming storage that would otherwise leak indefinitely.

Architecture Preview
Loading architecture preview...
Key Design Decisions
Reed-Solomon (10+4) Erasure Coding

Choice

Split each block into 10 data shards + 4 parity shards instead of 3x replication

Rationale

Erasure coding achieves 11-nines durability at 1.4x storage overhead versus 3x with replication — a 53% storage cost reduction. For 1 PB of data, this saves approximately $36,800/month. The trade-off is decode latency: reads require fetching 10 shards from different nodes and performing Reed-Solomon decode (5-15ms), compared to reading a single replica (2ms). For cost-sensitive storage at scale, this trade-off is overwhelmingly favorable.

Raft Consensus for MetadataDB

Choice

5-node PostgreSQL Raft cluster instead of single DynamoDB instance

Rationale

The v1 variant's single MetadataDB is a potential single point of failure for metadata operations. Raft consensus replicates every metadata write to at least 2 followers before acknowledging, providing strong consistency and surviving up to 2 node failures. Write latency increases by approximately 10ms for the consensus overhead, but metadata writes are already I/O-bound, so this is acceptable for the fault tolerance gained.

Multipart Upload Pipeline

Choice

Split large objects into parts, upload independently, stitch server-side

Rationale

A 10 GB single-request upload is fragile — any network interruption requires full restart. Multipart upload splits it into 64 MB parts: each uploaded independently, resumable on failure, uploadable in parallel. Only the failed part needs retransmission. The server stitches parts after all are verified. This is essential for production file systems handling terabyte-scale objects over unreliable networks.

Copy-on-Write Versioning

Choice

Overwrites create new shard sets; previous versions preserved until lifecycle expiry

Rationale

When an object is overwritten, the previous version's shards remain intact. A new version with new shards is created, and metadata tracks the version chain. This enables instant rollback (point to previous version's shards), audit compliance (every version is immutable), and protection against accidental deletion. Storage cost grows with version count, managed by lifecycle policies that auto-delete old versions.

Lifecycle Tiering (Standard, IA, Glacier)

Choice

Automatically move cold data to cheaper storage tiers based on access patterns

Rationale

80% of stored data is rarely accessed after 30 days. Lifecycle tiering moves it from standard storage ($0.023/GB/month) to infrequent-access ($0.014/GB/month) and eventually to glacier ($0.004/GB/month). A 1 PB dataset with 80% cold data costs approximately $6,000/month with tiering versus $23,000/month on standard — a 74% cost reduction. The LifecycleWorker evaluates policies continuously and executes transitions asynchronously.

Multi-Region Replication via Kafka

Choice

Every write publishes to ReplicationStream; cross-region consumer replicates to DR region

Rationale

A full region outage (rare but possible) would cause data unavailability in the single-region v1 approach. Cross-region replication maintains a hot standby in the DR region with RPO under 1 minute (Kafka consumer lag). RTO is approximately 5 minutes (DNS failover plus Raft leader election in DR region). The cost is substantial — doubled storage plus $0.02/GB data transfer — but necessary for mission-critical data.

Scale & Performance

Target RPS

2M+ ops/sec (metadata + data I/O)

Latency (p99)

<50ms small object reads, <150ms writes (Raft + erasure)

Storage

Exabytes (erasure-coded, lifecycle-tiered)

Availability

99.999% (multi-region, Raft consensus)

Time & Space Complexity
OperationTimeSpaceNotes
GET object (small, cache hit, no decode needed)O(1) cache + O(10) parallel shard reads + O(1) concatenationO(1.4 * B) shard data fetched, O(B) reconstructed objectBest case: all 10 data shards available, no parity decode needed. Cache hit (2ms) + 10 parallel reads (15ms) + concat (1ms) = approximately 18ms.
GET object (cache miss, parity decode needed)O(1) DB read + O(10) parallel shard reads + O(S) Reed-Solomon decodeO(1.4 * B) shard data + O(B) decode bufferWorst case: cache miss (15ms) + 10 shard reads (15ms) + RS decode (15ms) = approximately 45ms. Still sub-50ms.
PUT object (erasure encode + Raft commit)O(S) encode + O(14) parallel shard writes + O(1) Raft commitO(1.4 * B) total shard storageEncode (10ms) + 14 parallel writes (20ms) + Raft (15ms) + cache update (2ms) = approximately 50ms. Multi-region add 5ms Kafka publish.
Lifecycle transition (standard to glacier)O(14) shard reads + O(1) archive write + O(1) metadata updateO(1.4 * B) temporary during transitionBackground operation by LifecycleWorker. Not latency-sensitive. Approximately 100ms per object transition.
Database Schema (HLD)
object_metadata (Raft PostgreSQL)

Source of truth for object-to-shard mappings in a 5-node Raft consensus cluster. Each row maps an object key + version to its 14 shard locations, size, checksum, storage class, and lifecycle state. Strong consistency via Raft ensures read-after-write guarantees. Versioned: each PUT creates a new row with a new version_id.

object_key VARCHAR (bucket/key composite, partition key)version_id VARCHAR (UUID, unique per version)shard_ids ARRAY (14 shard IDs: 10 data + 4 parity)shard_locations ARRAY (ChunkServer node IDs per shard)size_bytes BIGINT (original object size)checksum VARCHAR (SHA-256 of original data)storage_class VARCHAR (STANDARD, IA, or GLACIER)is_latest BOOLEAN (true for current version)deleted BOOLEAN (tombstone flag for GC)lifecycle_state VARCHAR (current tier)created_at TIMESTAMPTZ (version creation timestamp)last_accessed_at TIMESTAMPTZ (for lifecycle evaluation)

256 partitions across 5 Raft nodes. Raft leader handles writes; reads go to leader for strong consistency. Log-normal latency distribution.

erasure_shards (ChunkServers / EBS io2)

Stores erasure-coded shards across distributed storage nodes. Each object produces 14 shards (10 data + 4 parity) via Reed-Solomon coding. Shards are immutable — overwrites create new shard sets. Per-shard SHA-256 checksums detect bit rot on every read.

shard_id VARCHAR (composite: object_key:version:shard_index)object_key VARCHAR (parent object key)version_id VARCHAR (object version)shard_index NUMBER (0-13, where 0-9 are data, 10-13 are parity)shard_type VARCHAR (DATA or PARITY)data BLOB (shard data, 1/10th of object size for data shards)checksum VARCHAR (SHA-256 of shard data)storage_tier VARCHAR (STANDARD or IA)created_at TIMESTAMPTZ (write timestamp)

256 partitions via consistent hashing. Immutable shards — no in-place updates. Two storage tiers: standard (io2) and IA (gp3).

archived_shards (ColdStore / S3 Glacier)

Cold storage archive for lifecycle-tiered objects. Shards are merged into archive bundles for efficient long-term storage. Retrieval requires an explicit restore request with configurable speed (bulk: 12 hours, standard: 5 hours, expedited: 5 minutes).

archive_id VARCHAR (archive bundle ID)object_key VARCHAR (original object key)version_id VARCHAR (object version)shard_bundle BLOB (merged shard data)shard_mapping ARRAY (original shard index mapping)archived_at TIMESTAMPTZ (archive timestamp)retention_until TIMESTAMPTZ (retention lock expiry)

Storage cost approximately $0.004/GB/month — 83% cheaper than standard tier. Retrieval incurs additional per-GB charges.

idx:{bucket}:{key} (IndexCache / Redis)

Cached object-to-shard mappings for fast metadata lookups. 16-node Redis cluster with 94% hit rate, reducing MetadataDB Raft leader load by 16x. Write-through on PUT ensures read-after-write consistency.

version_id VARCHAR (current version ID)shard_ids ARRAY (14 shard IDs)shard_locations ARRAY (ChunkServer node IDs)size_bytes NUMBER (object size)checksum VARCHAR (SHA-256)storage_class VARCHAR (storage tier)created_at VARCHAR (ISO timestamp)

Approximately 150 bytes per entry. 2B cached entries is approximately 300 GB. 3600-second TTL with LRU eviction.

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
Naive (Single NFS-Like Server)T150-300ms small files, minutes for large files~400 ops/sec (single disk I/O ceiling)$300/month (single server + RDS)Minimal — 3 components, no distribution99% (single disk, no redundancy)
Metadata + Block Storage (HDFS/S3 Pattern)T2<100ms small objects, seconds for large objects650K ops/sec (distributed I/O)$5,000/month (8 components, 3x replication)Medium — metadata/data separation, repair pipeline99.99% (3x replication, 11-nines durability)
Erasure-Coded Multi-Region StorageT3<50ms small objects (erasure decode adds 5-15ms)2M ops/sec (erasure + multi-region)$8,000/month (11 components, 1.4x storage + cross-region)High — Raft, erasure coding, lifecycle tiering99.999% (erasure coding, multi-region DR)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does erasure coding achieve the same durability as 3x replication with less storage?

Reed-Solomon (10+4) splits data into 10 data shards and 4 parity shards. Any 10 of 14 shards can reconstruct the original data — the system tolerates 4 simultaneous shard losses. With 3x replication, data survives only 2 simultaneous losses. Erasure coding actually provides better fault tolerance (4 vs 2 failures) at 1.4x storage versus 3x. The mathematical guarantee comes from the Reed-Solomon algebraic structure: 4 parity shards encode enough redundant information to recover any 4 missing shards from the remaining 10.

What is the read latency trade-off between replication and erasure coding?

With 3x replication, a read fetches data from the nearest single replica in approximately 2ms (one network hop, one disk read). With erasure coding, a read fetches 10 shards from 10 different ChunkServers in parallel (approximately 15ms for the slowest shard) and decodes them (1-15ms depending on whether parity reconstruction is needed). Total: approximately 2ms for replication versus 20-30ms for erasure coding. This 10-15x latency difference is why production systems use replication for hot data and erasure coding for warm or cold data.

How does Raft consensus affect metadata write performance?

Every metadata write must be replicated to at least 2 of 4 followers before the Raft leader acknowledges — adding approximately 10ms to write latency (two network round-trips within the cluster). During normal operation, the leader batches multiple writes into a single Raft log entry to amortize the consensus overhead. During leader elections (triggered by leader failure, lasting 1-5 seconds), writes are unavailable. The 5-node cluster tolerates up to 2 simultaneous node failures while maintaining majority quorum.

How does lifecycle tiering save 74% on storage costs?

Standard storage costs $0.023/GB/month. Infrequent-access (IA) costs $0.014/GB/month (40% savings). Glacier Deep Archive costs $0.004/GB/month (83% savings). For a 1 PB dataset where 10% is hot (standard), 10% is warm (IA), and 80% is cold (glacier): standard-only cost is $23,000/month; tiered cost is $2,300 + $1,400 + $3,200 = $6,900/month — a 70% reduction. With typical access patterns and aggressive lifecycle policies, savings reach 74% or more.

How does multipart upload improve reliability for large objects?

A 10 GB single-request upload has a failure probability proportional to the transfer duration (minutes). Any interruption requires restarting the entire upload. Multipart upload splits the object into 160 parts of 64 MB each. Each part is uploaded independently in approximately 1 second. If a part fails, only that 64 MB part is retried — not the entire 10 GB. Parts can be uploaded in parallel (e.g., 10 at a time) for 10x throughput improvement. The server tracks in-progress multipart uploads and stitches parts after all are verified. Abandoned uploads are cleaned up by LifecycleWorker.

What happens during a full region outage?

Every write is replicated to the DR region via ReplicationStream (Kafka) with RPO under 1 minute. On region failure: (1) monitoring detects the outage within 30 seconds, (2) DNS is updated to point to the DR region's ApiGateway within 2 minutes, (3) the DR region's Raft cluster elects a new leader within 5 seconds, (4) clients transparently failover via DNS. Total RTO is approximately 5 minutes. Data written to the primary region within the last minute of the outage (RPO window) may not be available in DR until the primary region recovers.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Distributed File System?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator