Hard8 componentsInterview: Very High

Distributed File System — Metadata + Distributed Block Storage

Q: Why is the MetadataService a potential bottleneck?

All operations route through MetadataService for metadata lookup or mutation. At extreme scale (100M+ objects/sec), the metadata layer becomes the bottleneck even with Redis caching. HDFS's original NameNode had exactly this limitation — a single process managing the entire namespace. The solution is metadata federation: multiple MetadataService instances, each owning a namespace partition (e.g., by bucket). The v2 variant uses a Raft consensus cluster for distributed metadata management.

Q: Why 3x replication instead of erasure coding?

3x replication provides the fastest reads (any single replica, no decode) and fastest writes (parallel replicate, no encode computation). For hot data accessed frequently, replication's 3x storage overhead is justified by read performance — a single replica read in 2ms vs erasure decode in 15ms. Erasure coding (Reed-Solomon 10+4 at 1.4x overhead) is better for cold data where the storage savings outweigh the decode latency cost. The v2 variant implements erasure coding with tiered storage.

Q: How does the RepairWorker maintain 11-nines durability?

The RepairWorker runs continuously, checking block health via heartbeats from storage nodes. When a node fails or a block's checksum does not match (bit rot), the worker reads a healthy replica from a surviving node, writes a new replica to a different node in a different fault domain, and updates MetadataDB with the new replica location. Blocks with only 1 remaining replica are repaired with highest priority. With a mean time to repair of hours and a mean time between failures of months per node, the probability of losing all 3 replicas before repair is approximately 10^-11 — hence 11-nines durability.

Q: What happens during a node failure?

When a BlockStore node fails, blocks on that node drop from 3 replicas to 2. Client reads continue from the surviving 2 replicas with zero downtime — the MetadataService returns replica locations, and the client reads from any available replica. Meanwhile, RepairWorker detects the under-replication (via missing heartbeats and audit stream events) and begins re-replicating affected blocks to new nodes. The repair process is gradual to avoid I/O storms: approximately 1K blocks/sec, completing a full node's repair in hours.

Q: How does the block model enable parallel I/O?

A 640 MB object is split into 10 blocks of 64 MB each, distributed across 10 different storage nodes. On download, the client can read all 10 blocks in parallel from 10 different nodes — achieving 10x the throughput of a single-node sequential read. On upload, the client writes each block to 3 different nodes in parallel. This parallelism scales linearly with the number of blocks and storage nodes, which is why distributed file systems can achieve terabytes/sec of aggregate throughput.

Industry-standard HDFS/S3 pattern: a MetadataService maps object keys to block locations while distributed BlockStore nodes hold the actual data. Objects split into 64 MB blocks, each replicated 3x across fault domains. MetadataCache (Redis) provides 95% hit rate. RepairWorker maintains 11-nines durability via continuous background re-replication.

StorageHDFSBlock StorageReplicationProduction

Try in Simulator

Problem Statement

The metadata-plus-block-storage architecture is the industry-standard approach to distributed file systems, pioneered by Google File System (GFS) and adopted by HDFS, Amazon S3, and virtually every production object storage system. It solves the fundamental limitations of the single-server naive approach by separating the metadata plane from the data plane and distributing data across many storage nodes.

The key architectural insight is that metadata and data have fundamentally different characteristics. Metadata is small (approximately 100 bytes per object: key, block list, checksums, permissions) and accessed frequently — every read, write, list, and delete operation starts with a metadata lookup. Data is massive (objects range from bytes to terabytes) and accessed less frequently. By separating them, each plane can be optimized independently: metadata fits in memory and can be cached aggressively (Redis at 95% hit rate), while data is distributed across hundreds of storage nodes for parallel I/O.

The data plane uses a block-based storage model. Objects are split into fixed-size blocks (64 MB), each stored with 3x replication across different fault domains (availability zones). When a client uploads a 640 MB file, it is split into 10 blocks. Each block is written to 3 different BlockStore nodes in different AZs — 30 block writes total, many of which happen in parallel. This provides both durability (any 2 replicas can fail without data loss) and read performance (the client reads from the nearest replica). The 64 MB block size is a carefully chosen trade-off: smaller blocks would mean more metadata rows per object (increasing metadata DB load), while larger blocks would reduce parallelism opportunities and waste space for small objects.

The metadata plane consists of a MetadataService backed by MetadataDB (DynamoDB) and MetadataCache (Redis). On write, MetadataService allocates block IDs, selects storage nodes across fault domains, writes the object-to-block mapping to both DB and cache, and returns block locations to the client for direct data upload. On read, MetadataService looks up the block list in cache (95% hit rate, 2ms) or DB (15ms on miss), and returns block locations for direct data download. This two-phase approach — metadata lookup followed by direct data transfer — means the MetadataService handles only lightweight metadata operations, not the heavy data I/O.

A RepairWorker runs continuously in the background, monitoring block health via heartbeats from storage nodes. When a node fails or a block is corrupted (checksum mismatch), the worker reads a healthy replica, writes a new replica to a different node, and updates the metadata. This maintains 3x replication for 11-nines durability (99.999999999%). The repair process is asynchronous and prioritized — blocks with only 1 remaining replica are repaired before those with 2 replicas, preventing data loss during cascading failures. It does not impact client-facing performance because repair I/O is rate-limited to avoid saturating storage nodes.

The storage overhead of 3x replication is the primary cost concern at scale. For 1 PB of user data, the system stores 3 PB of raw block data. This overhead motivates the erasure-coded v2 variant, which achieves the same durability at only 1.4x overhead — a 53% reduction in raw storage cost.

This architecture appears in every system design interview for storage roles. Interviewers expect candidates to explain the metadata/data separation, articulate the block replication strategy, reason about the RepairWorker's role in durability, and identify the MetadataService as a potential bottleneck at extreme scale — which motivates the erasure-coded variant's use of Raft consensus for distributed metadata.

Architecture Overview

The distributed file system uses 8 components organized into separate metadata and data planes with an audit and repair pipeline. The components are: StorageClient, ApiGateway, MainLB, MetadataService, MetadataCache (Redis), MetadataDB (DynamoDB), BlockStore (S3-modeled distributed storage), AuditStream (Kafka), and RepairWorker.

All traffic enters through the ApiGateway, which performs IAM-style signature authentication (approximately 3ms), enforces per-bucket rate limits (800K RPS capacity), and routes to the MainLB. The Load Balancer distributes requests across 25 MetadataService pods using round-robin — metadata operations are stateless. The ApiGateway adds approximately 1.5ms of transformation latency for request normalization and provides a centralized point for rate limiting, authentication, and routing.

The MetadataService is the brain of the system. It handles four operations: GET (resolve object to block mapping), PUT (allocate blocks, select replica nodes, write mapping), DELETE (mark tombstone, schedule garbage collection), and LIST (prefix scan). For GET, it checks MetadataCache first — with a 95% hit rate, most lookups return in 2ms. On cache miss, it falls back to MetadataDB at approximately 15ms. For PUT, it writes to both MetadataDB (durable) and MetadataCache (fast subsequent reads), ensuring read-after-write consistency. The service runs 25 pods with 100 threads each for 312K sustained RPS.

MetadataCache is a 12-node Redis cluster caching object-to-block mappings. Each entry is approximately 100 bytes (key, block IDs, replica locations, size, checksum). With billions of cached entries and a 95% hit rate, the cache reduces MetadataDB load by 20x. Write-through on PUT ensures newly stored objects are immediately readable from cache. LRU eviction with 1-hour TTL balances memory usage with freshness. The cache uses a single-flight pattern to prevent thundering herd on cache misses — if multiple concurrent requests miss the same key, only one query hits the database while others wait for the result.

MetadataDB is DynamoDB on-demand with 128 partitions and 3 replicas. It stores the authoritative object-to-block mappings with strong consistency mode for read-after-write guarantees. The partition key is the bucket/key hash, distributed across partitions for uniform load. It handles approximately 25K reads/sec (cache misses) and 50K writes/sec. Latency follows a log-normal distribution with a tail at approximately 30ms for reads and 60ms for writes, reflecting the realistic behavior of distributed databases under production load.

BlockStore represents the distributed data storage fleet — modeled as S3 Standard with 128 partitions and 3 replicas. Objects are split into 64 MB blocks, each stored with 3x replication across different fault domains. Clients read and write blocks directly (bypassing MetadataService for data plane operations), providing horizontal scalability: adding more storage nodes increases both capacity and throughput linearly. Each block has a SHA-256 checksum stored in MetadataDB, verified on every read to detect silent data corruption (bit rot).

AuditStream is a 64-partition Kafka cluster logging every write operation (PUT and DELETE). RepairWorker consumes these events to track newly written blocks and monitor their health. It continuously scans for under-replicated blocks (fewer than 3 healthy replicas due to node failures or corruption) and creates new replicas from healthy copies. The worker runs 30 instances processing approximately 1K repair operations per second, prioritizing blocks by remaining replica count. This background repair process is the durability guarantee engine — it maintains 3x replication despite hardware failures, achieving 11-nines durability.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Metadata/Data Plane Separation

Choice

Separate MetadataService + cache from distributed BlockStore nodes

Rationale

Metadata is small (100 bytes per object) and fits in memory. Data is massive (exabytes) and requires distributed disk storage. Separating them means metadata operations (list, head, permissions) are fast and do not compete with data I/O. This is the HDFS NameNode/DataNode pattern. MetadataService handles 500K+ metadata reads/sec while BlockStore handles terabytes/sec of data throughput independently.

64 MB Block Size with 3x Replication

Choice

Split objects into fixed 64 MB blocks, each replicated 3 times across fault domains

Rationale

64 MB balances metadata overhead (smaller blocks mean more metadata rows) against parallelism (larger blocks reduce parallel download opportunities). 3x replication provides 11-nines durability: data survives any 2 simultaneous node failures. The storage overhead is 3x — 1 PB of data requires 3 PB of raw storage. The Erasure-Coded variant reduces this to 1.4x.

MetadataCache (Redis) for Fast Lookups

Choice

12-node Redis cluster caching object-to-block mappings with 95% hit rate

Rationale

At 500K metadata reads/sec, hitting DynamoDB directly would require massive read IOPS. Redis absorbs 95% of reads at 2ms latency, reducing DB load to approximately 25K reads/sec. Metadata entries are small (100 bytes), so billions fit in 32 GB of Redis memory. Write-through on PUT ensures read-after-write consistency.

Async Repair via RepairWorker

Choice

Background worker continuously monitors and re-replicates under-replicated blocks

Rationale

When a storage node fails, synchronously re-replicating all its blocks (approximately 100 TB per node) would create a massive I/O burst degrading client-facing performance. Async repair spreads the load over hours, prioritizing blocks with only 1 remaining replica (critical) over those with 2 replicas (warning). This prevents repair storms.

Kafka Audit Stream for Compliance and Repair

Choice

Every write publishes an audit event to Kafka, consumed by RepairWorker

Rationale

The audit stream serves dual purposes: compliance logging (who stored what, when) and repair triggering (RepairWorker tracks newly written blocks that need health verification). This eliminates the need for a separate health check system. 64 partitions with 500K msg/sec capacity handle peak write traffic.

Strong Read-After-Write Consistency

Choice

Write-through cache plus strong consistency mode on DynamoDB

Rationale

When MetadataService writes a new object's metadata, it writes to both MetadataDB (durable) and MetadataCache (fast). Subsequent reads hit the cache for strong consistency on that key. The data plane write completes only when all 3 block replicas acknowledge. This ensures a successful PUT is immediately readable via GET.

Scale & Performance

Target RPS

650K ops/sec (metadata + data I/O)

Latency (p99)

<100ms small object reads, <200ms small object writes

Storage

Petabytes (horizontally scalable data plane)

Availability

99.99% (replicated components, 11-nines durability)

Time & Space Complexity

Operation	Time	Space	Notes
GET object (small, cache hit)	O(1) cache lookup + O(1) block read from nearest replica	O(B) where B = object size (block data transfer)	Cache hit at 2ms + single block read at 20ms = approximately 22ms total. 95% of reads take this path.
GET object (large, N blocks)	O(1) metadata + O(N) parallel block reads	O(B) aggregate block data	N blocks read in parallel from N different nodes. Throughput scales linearly with block count.
PUT object (N blocks, 3x replication)	O(1) metadata write + O(N) parallel block writes x 3 replicas	O(3B) total storage (3x replication overhead)	Each block written to 3 nodes in parallel. Metadata commit includes DB + cache write-through.
LIST objects by prefix	O(log N + K) range scan on MetadataDB/Cache	O(K) result set, K = matching objects	Prefix scan on DynamoDB partition key range. Cache hit for hot prefixes.

Database Schema (HLD)

object_metadata (DynamoDB)

Source of truth for object-to-block mappings. Partition key is the composite bucket/key hash distributed across 128 DynamoDB partitions with 3 replicas. Strong consistency mode ensures read-after-write guarantees. Each row maps an object key to its block IDs, replica locations, size, and SHA-256 checksum.

object_key VARCHAR (bucket/key composite, partition key)version NUMBER (object version for ordering)block_ids ARRAY (list of 64 MB block UUIDs)replicas ARRAY (replica locations per block, 3 per block)size_bytes NUMBER (original object size)checksum VARCHAR (SHA-256 of complete object)created_at VARCHAR (ISO 8601 timestamp)deleted BOOLEAN (tombstone flag for GC)

Growth proportional to total object count. At trillions of objects, table reaches approximately 100 TB. Strong consistency mode adds approximately 5ms to read latency vs eventual consistency.

data_blocks (BlockStore / S3)

Block-level data storage representing distributed DataNodes. Each 64 MB block is stored with 3x replication across different fault domains. Blocks are immutable once written. Clients read and write blocks directly via block ID.

block_id VARCHAR (UUID, primary key)data BLOB (block data, 64 MB max)checksum VARCHAR (SHA-256 of block data)replica_index NUMBER (replica number: 0, 1, or 2)

128 partitions distributed across storage nodes. Checksums verified on every read to detect bit rot. Immutable — overwrites create new block IDs.

meta:{bucket}:{key} (MetadataCache / Redis)

Cached object-to-block mappings for fast metadata lookups. Write-through on PUT ensures read-after-write consistency. LRU eviction with 1-hour TTL. 95% hit rate reduces DynamoDB load by 20x.

block_ids LIST (block UUID list)replicas LIST (replica locations per block)size_bytes NUMBERchecksum VARCHAR (SHA-256)created_at VARCHAR (ISO timestamp)

100 bytes per entry. 1B cached entries is approximately 100 GB. 12-node Redis cluster with 32 GB per node.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Single NFS-Like Server)	T1	50-300ms small files, minutes for large files	~400 ops/sec (single disk I/O ceiling)	$300/month (single server + RDS)	Minimal — 3 components, no distribution	99% (single disk, no redundancy)
Metadata + Block Storage (HDFS/S3 Pattern)	T2	<100ms small objects, seconds for large objects	650K ops/sec (distributed I/O)	$5,000/month (8 components, 3x replication)	Medium — metadata/data separation, repair pipeline	99.99% (3x replication, 11-nines durability)
Erasure-Coded Multi-Region Storage	T3	<50ms small objects (erasure decode adds 5-15ms)	2M ops/sec (erasure + multi-region)	$8,000/month (11 components, 1.4x storage + cross-region)	High — Raft, erasure coding, lifecycle tiering	99.999% (erasure coding, multi-region DR)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why is the MetadataService a potential bottleneck?

All operations route through MetadataService for metadata lookup or mutation. At extreme scale (100M+ objects/sec), the metadata layer becomes the bottleneck even with Redis caching. HDFS's original NameNode had exactly this limitation — a single process managing the entire namespace. The solution is metadata federation: multiple MetadataService instances, each owning a namespace partition (e.g., by bucket). The v2 variant uses a Raft consensus cluster for distributed metadata management.

Why 3x replication instead of erasure coding?

3x replication provides the fastest reads (any single replica, no decode) and fastest writes (parallel replicate, no encode computation). For hot data accessed frequently, replication's 3x storage overhead is justified by read performance — a single replica read in 2ms vs erasure decode in 15ms. Erasure coding (Reed-Solomon 10+4 at 1.4x overhead) is better for cold data where the storage savings outweigh the decode latency cost. The v2 variant implements erasure coding with tiered storage.

How does the RepairWorker maintain 11-nines durability?

The RepairWorker runs continuously, checking block health via heartbeats from storage nodes. When a node fails or a block's checksum does not match (bit rot), the worker reads a healthy replica from a surviving node, writes a new replica to a different node in a different fault domain, and updates MetadataDB with the new replica location. Blocks with only 1 remaining replica are repaired with highest priority. With a mean time to repair of hours and a mean time between failures of months per node, the probability of losing all 3 replicas before repair is approximately 10^-11 — hence 11-nines durability.

What happens during a node failure?

When a BlockStore node fails, blocks on that node drop from 3 replicas to 2. Client reads continue from the surviving 2 replicas with zero downtime — the MetadataService returns replica locations, and the client reads from any available replica. Meanwhile, RepairWorker detects the under-replication (via missing heartbeats and audit stream events) and begins re-replicating affected blocks to new nodes. The repair process is gradual to avoid I/O storms: approximately 1K blocks/sec, completing a full node's repair in hours.

How does the block model enable parallel I/O?

A 640 MB object is split into 10 blocks of 64 MB each, distributed across 10 different storage nodes. On download, the client can read all 10 blocks in parallel from 10 different nodes — achieving 10x the throughput of a single-node sequential read. On upload, the client writes each block to 3 different nodes in parallel. This parallelism scales linearly with the number of blocks and storage nodes, which is why distributed file systems can achieve terabytes/sec of aggregate throughput.

Related Templates

Distributed File System — Naive (Single NFS-Like Server)Distributed File System — Erasure-Coded Multi-Region

Discussion

Ready to design your own Distributed File System?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator