Hard9 componentsInterview: High

Distributed Filesystem — Metadata + Block Storage

Q: What does 11-nines durability mean and how is it achieved?

Eleven nines of durability (99.999999999%) means that if you store 10 billion objects, you would statistically expect to lose fewer than one object over the course of a year. This level of durability is achieved through 3x replication across independent fault domains combined with continuous background repair. The RepairWorker detects under-replicated blocks (from node failures or data corruption) and creates new replicas before a second failure can cause data loss. The probability of losing all three replicas before repair completes is astronomically low.

Q: Why are metadata and data stored separately instead of together?

Metadata (object key, block list, checksums) is roughly 100 bytes per object, while the actual data can range from bytes to terabytes. Even for trillions of objects, all metadata fits in about 100TB and can be served from memory-backed caches. Storing metadata alongside data would mean that lightweight operations like listing objects or checking permissions would compete with heavy data I/O for disk bandwidth and network resources. Separating them allows each plane to be optimized, scaled, and operated independently.

Q: How does the system guarantee read-after-write consistency?

When MetadataService writes a new object, it stores the metadata in both MetadataDB (using DynamoDB's strong consistency mode) and MetadataCache (Redis, via write-through). Any subsequent read hits either the cache (which already has the entry) or the database (which has it via strong consistency). On the data plane, the write API returns success only after all three block replicas acknowledge receipt. This means a client that receives a successful PUT response is guaranteed to read back the same data immediately.

Q: What happens when a storage node fails?

When a storage node fails, all blocks stored on that node drop from 3 replicas to 2. The RepairWorker detects this through missing heartbeats and begins re-replication: it reads each affected block from one of the two healthy replicas and writes a new copy to a different healthy node, then updates MetadataDB with the new replica location. Repair is asynchronous and prioritized, so blocks with only 1 remaining replica are repaired first. During the repair window, reads continue to be served from the remaining healthy replicas with no user-visible impact.

Q: Why is Kafka used for audit events instead of direct repair triggers?

Kafka provides a durable, ordered event log that serves a dual purpose. First, every write operation is recorded for compliance auditing (who stored what, when, and where). Second, the RepairWorker consumes these events to track newly written blocks that need health verification. Using Kafka instead of direct triggers decouples the write path from the repair system, meaning a slow or failing repair process never blocks client-facing writes. The event stream also provides natural buffering during write traffic spikes.

Design an S3-like distributed object storage system with separated metadata and data planes, 3x block replication across fault domains, and continuous repair for 11-nines durability at exabyte scale.

StorageReplicationHDFS PatternDurability

Try in Simulator

Problem Statement

Distributed file storage is a cornerstone of modern cloud infrastructure, underpinning services from Amazon S3 to Google Cloud Storage to HDFS-based data lakes. It is a frequent system design interview topic at companies that operate large-scale storage platforms, because it tests a candidate's understanding of data durability, replication strategies, metadata management, and the trade-offs between consistency and availability at exabyte scale.

The central challenge is achieving extreme durability (11 nines, or 99.999999999%) while maintaining high throughput and low latency for diverse object sizes ranging from tiny configuration files to multi-terabyte data lake partitions. A single bit flip or disk failure must never result in data loss, which requires replicating every block of data across multiple independent fault domains and continuously monitoring for and repairing any degradation.

The architecture must separate the metadata plane from the data plane because they have fundamentally different characteristics. Metadata is small (roughly 100 bytes per object: key, block list, checksums) and fits in memory even for trillions of objects. Data is massive (exabytes) and must be distributed across hundreds or thousands of storage nodes. Coupling these two planes would mean that metadata operations (list, head, permission checks) compete with bulk data I/O, degrading performance for both. This separation is the same pattern used by HDFS (NameNode vs DataNode) and Amazon S3 internally.

At production scale, the system must handle 500K metadata reads per second, 100K block reads per second, and 50K writes per second at peak. Objects are split into 64MB blocks, each replicated three times across different failure domains. A background repair worker continuously scans for under-replicated blocks and re-replicates them to maintain durability guarantees. The interview version of this problem expects candidates to address block placement strategy, consistency guarantees (read-after-write), garbage collection of deleted objects, and the operational complexity of running a storage system at this scale.

Architecture Overview

The architecture follows the HDFS/S3 pattern of separating the lightweight metadata plane from the heavy data plane. All client requests enter through the API Gateway (IAM-style authentication, per-bucket rate limiting) and load balancer, then route to the MetadataService, which serves as the coordination layer for all storage operations.

For object reads, MetadataService first checks MetadataCache (a 12-node Redis cluster) for the object-to-block mapping. With a 95% cache hit rate, most metadata lookups complete in approximately 2ms. On cache miss, MetadataService falls back to MetadataDB (DynamoDB with strong consistency, 128 partitions, 3 replicas), which responds in approximately 15ms. The response includes block IDs and replica locations, and the client then reads block data directly from the nearest BlockStore replica, bypassing MetadataService for the data plane transfer.

For object writes, MetadataService allocates block IDs, selects three storage nodes across different fault domains for replicas, writes the object-to-block mapping to both MetadataDB (durable) and MetadataCache (for immediate read-after-write consistency), and returns the block locations to the client. The client then writes 64MB block data directly to the three BlockStore replicas in parallel. The API returns success only after all three replicas acknowledge, ensuring durability before the caller proceeds. Every write operation also generates an audit event in the Kafka-based AuditStream for compliance and repair tracking.

The RepairWorker is the durability guarantee engine. Running continuously with 30 worker instances, it consumes audit events from AuditStream, periodically checks block health via checksum verification (SHA-256), and re-replicates any under-replicated blocks. When a storage node fails, the worker reads a healthy replica from BlockStore and writes a new replica to a different node, then updates MetadataDB with the new location. Repair is prioritized by urgency: blocks with only one remaining replica are repaired before blocks with two replicas. This asynchronous approach spreads repair load over hours rather than creating an I/O burst that would degrade client-facing performance.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Separated Metadata and Data Planes

Choice

MetadataService + MetadataDB for mappings; BlockStore for data

Rationale

Metadata is small (roughly 100 bytes per object) and fits in memory even for trillions of objects, while data is massive (exabytes) and distributed across storage nodes. Separating the planes means metadata operations like list, head, and permission checks are fast and never compete with bulk data I/O. This is the HDFS NameNode/DataNode pattern that has been proven at exabyte scale.

3x Block Replication over Erasure Coding

Choice

Every 64MB block stored as 3 full replicas across different fault domains

Rationale

Replication provides the fastest reads (any replica, no decode step) and fastest writes (parallel replication, no encoding). For hot, frequently accessed data, the 3x storage overhead is justified by superior read performance. Erasure coding (1.4x overhead) is better suited for cold data tiers where storage cost matters more than access latency.

Asynchronous Repair over Synchronous Re-Replication

Choice

RepairWorker continuously monitors and re-replicates in the background

Rationale

When a storage node fails, it may hold 100TB or more of block data. Synchronously re-replicating all blocks immediately would create a massive I/O burst that degrades client-facing performance. The RepairWorker spreads repair load over hours, prioritizing critically under-replicated blocks (1 remaining replica) over those with 2 replicas, preventing repair storms.

Redis Metadata Cache with Write-Through

Choice

MetadataCache (Redis) in front of MetadataDB with write-through on PUT

Rationale

At 500K metadata reads per second, hitting DynamoDB directly would require massive read IOPS. Redis caches the hot working set with a 95% hit rate, reducing database load to roughly 25K reads per second. Write-through on PUT ensures strong read-after-write consistency: a newly stored object is immediately visible in the cache without waiting for cache population.

Scale & Performance

Target RPS

650,000 peak (500K metadata reads, 100K block reads, 50K writes)

Latency (p99)

< 100ms p99 for small object reads; < 200ms p99 for writes (3x replica ack)

Storage

Exabytes of block data (3x replicated); 100+ TB metadata in DynamoDB

Availability

99.99% — 11-nines durability via 3x replication and continuous repair

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

What does 11-nines durability mean and how is it achieved?

Eleven nines of durability (99.999999999%) means that if you store 10 billion objects, you would statistically expect to lose fewer than one object over the course of a year. This level of durability is achieved through 3x replication across independent fault domains combined with continuous background repair. The RepairWorker detects under-replicated blocks (from node failures or data corruption) and creates new replicas before a second failure can cause data loss. The probability of losing all three replicas before repair completes is astronomically low.

Why are metadata and data stored separately instead of together?

Metadata (object key, block list, checksums) is roughly 100 bytes per object, while the actual data can range from bytes to terabytes. Even for trillions of objects, all metadata fits in about 100TB and can be served from memory-backed caches. Storing metadata alongside data would mean that lightweight operations like listing objects or checking permissions would compete with heavy data I/O for disk bandwidth and network resources. Separating them allows each plane to be optimized, scaled, and operated independently.

How does the system guarantee read-after-write consistency?

When MetadataService writes a new object, it stores the metadata in both MetadataDB (using DynamoDB's strong consistency mode) and MetadataCache (Redis, via write-through). Any subsequent read hits either the cache (which already has the entry) or the database (which has it via strong consistency). On the data plane, the write API returns success only after all three block replicas acknowledge receipt. This means a client that receives a successful PUT response is guaranteed to read back the same data immediately.

What happens when a storage node fails?

When a storage node fails, all blocks stored on that node drop from 3 replicas to 2. The RepairWorker detects this through missing heartbeats and begins re-replication: it reads each affected block from one of the two healthy replicas and writes a new copy to a different healthy node, then updates MetadataDB with the new replica location. Repair is asynchronous and prioritized, so blocks with only 1 remaining replica are repaired first. During the repair window, reads continue to be served from the remaining healthy replicas with no user-visible impact.

Why is Kafka used for audit events instead of direct repair triggers?

Kafka provides a durable, ordered event log that serves a dual purpose. First, every write operation is recorded for compliance auditing (who stored what, when, and where). Second, the RepairWorker consumes these events to track newly written blocks that need health verification. Using Kafka instead of direct triggers decouples the write path from the repair system, meaning a slow or failing repair process never blocks client-facing writes. The event stream also provides natural buffering during write traffic spikes.

Related Templates

Dropbox Video Streaming Logging Pipeline

Discussion

Ready to design your own Distributed Filesystem?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator