Hard3 componentsInterview: High

Distributed File System — Naive (Single NFS-Like Server)

Q: Why is distributed file system design a top interview question?

Distributed file systems combine the fundamental challenges of system design: data durability (how to survive hardware failures), scalability (how to grow beyond one machine), consistency (how to keep metadata and data in sync), and performance (how to achieve high throughput for diverse file sizes). Every major tech company operates a distributed file system — Google has GFS/Colossus, Amazon has S3, Meta has Tectonic, Microsoft has Azure Blob Storage. Interviewers use this question to test whether candidates understand the progression from single-server to distributed architecture and can articulate the trade-offs at each step.

Q: Why does the naive approach fail at just 400 operations per second?

All file I/O flows through one disk controller on one server. A typical NVMe SSD supports approximately 500,000 random IOPS but only 3-5 GB/s sequential throughput. At 400 mixed operations per second with an average file size of 5 MB, the server is transferring 2 GB/s — approaching the disk bandwidth ceiling. Large file operations are particularly destructive: a single 10 GB upload consumes the disk controller for 2-3 seconds, during which all other operations queue. The database itself can handle 400 RPS easily; disk I/O is the hard bottleneck.

Q: What is the first optimization an interviewer expects?

Separate the metadata plane from the data plane. Instead of storing files on local disk and metadata in a co-located database, use a dedicated MetadataService backed by a fast cache (Redis) for metadata lookups, and distribute file data across multiple storage nodes (BlockStore). This separation allows each plane to scale independently: metadata operations are lightweight (100-byte rows) and can be cached aggressively, while data operations are I/O-heavy and benefit from parallelism across many disks.

Q: How does chunking improve large file performance?

Without chunking, a 10 GB file is one sequential write on one disk taking 2-3 seconds. With 64 MB chunking, the same file is split into approximately 160 blocks. These blocks can be written in parallel across 160 different storage nodes — reducing upload time from seconds to tens of milliseconds (limited by the slowest block write). Chunking also enables resumable uploads: if the connection drops at 80%, only the incomplete block needs to be re-uploaded instead of the entire 10 GB file. This is why every production file system (HDFS, S3, GFS) uses chunking.

Q: What is the durability difference between no replication and 3x replication?

With no replication (naive approach), durability equals the reliability of a single disk — approximately 99.9% annually (1-2% annual failure rate). Over 5 years, there is roughly a 5-10% chance of data loss. With 3x replication across independent fault domains (v1 approach), durability reaches 99.999999999% (eleven nines) — data survives any 2 simultaneous disk or node failures. The probability of losing data drops from 'likely within years' to 'effectively never in a human lifetime.' This dramatic improvement is why replication is non-negotiable for production file systems.

The simplest file storage system: a single FileService writes files to local disk and stores metadata in PostgreSQL. No chunking, no replication, no distribution. Demonstrates why single-server storage fails at even moderate scale due to capacity ceilings, I/O bottlenecks, and zero fault tolerance.

StorageBeginnerBottleneck AnalysisSingle Server

Try in Simulator

Problem Statement

Designing a distributed file system is one of the most fundamental system design interview questions because it forces candidates to reason about data durability, fault tolerance, storage scalability, and the separation of metadata from data. The naive single-server approach is where every candidate should start — it establishes the baseline that makes the improvements in distributed architectures measurable and concrete.

The core challenge is storing and retrieving files reliably at scale. In the naive approach, a single FileService handles all operations: upload writes the file to local disk and records metadata (file ID, filename, size, SHA-256 checksum, disk path) in a PostgreSQL database. Download queries the metadata from PostgreSQL, reads the file from local disk, and streams it to the client. List queries the filename column with a LIKE prefix match. Delete removes the file from disk and the metadata row from the database.

The naive approach has four fatal limitations that interviewers expect candidates to identify. First, storage capacity is bounded by a single machine's disk — typically 16 TB with NVMe SSD. When the disk fills up, the only option is a larger disk or a new server with manual data migration. There is no horizontal scaling. Second, there is zero fault tolerance: a single disk failure, server crash, or filesystem corruption causes permanent data loss for every file on that machine. Enterprise disks have a 1-2% annual failure rate, so data loss is a statistical certainty over time. Third, all I/O flows through one server's disk controller and network interface, creating a bandwidth bottleneck at approximately 125 MB/s (1 Gbps network). A single large file download can consume the entire bandwidth, starving other operations. Fourth, without chunking, large files cannot be uploaded or downloaded in parallel — a 10 GB upload is one sequential disk write that blocks other operations.

The consistency model in the naive approach is trivially strong: there is one disk and one database, so there is no replication lag, no eventual consistency, and no split-brain scenario. A write that returns success is guaranteed to be readable immediately. This simplicity is genuinely valuable for understanding the trade-offs introduced by distributed approaches, where strong consistency requires coordination protocols like Raft consensus (v2 variant) that add latency and operational complexity.

These limitations directly motivate the distributed architectures in the v1 and v2 variants. The Metadata + Block Storage variant (v1) separates metadata from data, splits files into 64 MB blocks distributed across storage nodes, and replicates each block 3x for durability. The Erasure-Coded variant (v2) goes further with Reed-Solomon coding for storage efficiency, Raft consensus for metadata consistency, multipart uploads for large files, and lifecycle tiering for cost optimization.

This template makes the single-server limitations visible and quantifiable. Run the simulation at increasing file sizes and watch disk I/O saturate while the service layer sits idle. Compare upload throughput with the v1 variant where the same file is split into blocks uploaded in parallel across different storage nodes — the difference is dramatic. The simulation also highlights the delete non-atomicity: the disk unlink and DB delete are two separate operations, so a crash between them leaves either an orphaned file (no metadata pointing to it) or a dangling metadata row (pointing to a deleted file).

Distributed file system design appears in interviews at Google (GFS/Colossus), Amazon (S3), Meta (Tectonic), Microsoft (Azure Blob), Dropbox, and Netflix. Interviewers expect candidates to start with the single-server baseline, identify the capacity/durability/throughput limitations, and propose separation of metadata and data planes as the first architectural improvement.

Architecture Overview

The naive file system is a three-component linear architecture: Client, FileService, and FileDB (PostgreSQL). There is no load balancer, no cache, no replication, no event stream, and no separation between metadata and data planes. This is the absolute minimum viable file storage system.

All traffic flows directly from the Client to the FileService, which handles all four operations: upload, download, list, and delete. The FileService is a stateless REST API running on 2 pods with 50 threads each on AWS ECS Fargate. Service processing time is approximately 10ms, but total response time is dominated by disk I/O for file reads and writes. For a small file (under 1 MB), total upload latency is 100-300ms (disk write with fsync plus database INSERT). For large files (1 GB or more), upload can take minutes as the entire file is written sequentially to a single disk.

FileDB is a single PostgreSQL instance (RDS db.r7g.xlarge with 4 vCPU and 32 GB RAM) storing file metadata. The files table has one row per file containing the file ID, human-readable filename, size in bytes, SHA-256 checksum, local disk path, content type, and timestamps. A B-tree index on file_id supports direct lookups, and an index on filename supports prefix-based listing queries. Metadata rows are small (approximately 200 bytes each), so millions of files fit comfortably. The database is not the bottleneck in this design — disk I/O is.

The system stores file data on the FileService's local disk at computed paths (/data/{file_id}). There is no object storage (S3), no distributed storage layer, and no RAID. Files exist in exactly one copy on one physical disk. The checksum computed on upload is verified on download to detect corruption, but there is no mechanism to recover from corruption — if the only copy is damaged, the data is lost.

The upload flow follows a strict sequence: FileService first writes the file bytes to local disk with fsync to ensure durability, then computes the SHA-256 checksum over the written data, and finally INSERTs the metadata row into FileDB. If the disk write succeeds but the DB INSERT fails, an orphaned file remains on disk consuming space with no metadata pointing to it. There is no cleanup mechanism for these orphans in the naive approach — a production system would need a garbage collection sweep comparing disk contents against database records.

The download flow is equally straightforward: FileService queries FileDB for the file's metadata by primary key (O(1) B-tree lookup), reads the file from the disk path specified in the metadata row, verifies the checksum, and streams the content to the client. If the checksum does not match (indicating silent data corruption or bit rot), the download fails with an error — there is no healthy replica to fall back to.

The concrete scaling ceiling is approximately 400 mixed operations per second, limited by disk I/O bandwidth. At this point, upload and download operations compete for the same disk controller, and large file operations starve small ones. The 100-connection PostgreSQL pool is not the bottleneck at this scale — the disk is. Storage capacity is bounded at approximately 16 TB, after which the server cannot accept new uploads.

There is no redundancy at any layer. If the FileService process crashes, it restarts but loses any in-flight uploads. If the disk fails, all file data is permanently lost. If PostgreSQL fails, files on disk become inaccessible because their metadata (paths, checksums) is gone. The system has a single point of failure at every component.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

Single Server with Local Disk

Choice

Store all files on one machine's local filesystem

Rationale

Local disk storage is the simplest possible approach — no SDK, no network hop to a storage service, no bucket configuration. A standard filesystem write stores the file immediately. The cost is a hard storage ceiling (single disk capacity, typically 16 TB), zero fault tolerance (one disk failure equals total data loss), and a bandwidth bottleneck (one disk controller, one network interface). The Metadata + Block Storage variant replaces local disk with distributed storage nodes across fault domains.

No Chunking

Choice

Store files as monolithic blobs on disk

Rationale

Without chunking, a 10 GB file is one contiguous blob on disk. The implementation is trivial — a single filesystem write. But this prevents parallel uploads and downloads, makes resumable transfers impossible, and means a single large file upload blocks disk I/O for all other operations. The v1 variant splits files into 64 MB blocks that can be uploaded and downloaded in parallel from different storage nodes.

No Replication

Choice

Every file exists in exactly one copy

Rationale

Zero redundancy means zero additional storage cost and zero replication complexity. But a disk failure, server crash, or filesystem corruption results in permanent data loss. Enterprise disk annual failure rate is 1-2%, making data loss a statistical certainty over a fleet lifetime. The v1 variant replicates each block 3x across different fault domains for 11-nines durability.

Single PostgreSQL for Metadata

Choice

One database instance for all file metadata

Rationale

A single PostgreSQL instance eliminates replication lag and failover complexity. Metadata rows are small (approximately 200 bytes), so millions of files fit easily. But the single DB is also a single point of failure for metadata — if PostgreSQL goes down, files on disk become inaccessible. The v1 variant adds a metadata cache (Redis) and separates the metadata plane from the data plane.

No Load Balancer

Choice

Client connects directly to FileService

Rationale

With only 2 service pods on a single machine, a load balancer adds latency without meaningful benefit. The bottleneck is disk I/O, not request distribution. This keeps the architecture at 3 components — the absolute minimum. The v1 variant adds an API Gateway and Load Balancer for authentication, rate limiting, and multi-pod distribution.

Scale & Performance

Target RPS

~400 mixed ops/sec (disk I/O ceiling)

Latency (p99)

50-300ms small files, minutes for GB-scale files

Storage

~16 TB (single disk capacity)

Availability

~99% (single server, no redundancy)

Time & Space Complexity

Operation	Time	Space	Notes
Upload file (PUT /files/{file_id})	O(S) disk write + O(1) DB INSERT, where S = file size	O(S) disk + O(1) metadata row	Disk write dominates: 1 MB takes ~10ms, 1 GB takes ~2-3s, 10 GB takes ~20-30s on NVMe SSD.
Download file (GET /files/{file_id})	O(1) DB SELECT + O(S) disk read	O(S) network transfer buffer	Metadata lookup is O(1) via primary key index. Disk read speed determines total latency for large files.
List files (GET /files?prefix=...)	O(log N + K) where N = total files, K = matching files	O(K) result set	B-tree index seek (log N) plus scan of K matching entries. Lightweight — no disk I/O for file data.
Delete file (DELETE /files/{file_id})	O(1) disk unlink + O(1) DB DELETE	O(1) freed	Filesystem unlink is near-instant. DB DELETE removes the metadata row. Not atomic — partial failure possible.

Database Schema (HLD)

files

Stores metadata for every file in the system. One row per file with B-tree indexes on file_id (primary key) and filename (for prefix listing queries). Rows are small (approximately 200 bytes) so millions of files fit comfortably in a single PostgreSQL instance. The disk_path column maps to the local filesystem location where the actual file data resides.

file_id VARCHAR PK (UUID, unique file identifier)filename VARCHAR (human-readable name, indexed for prefix queries)size_bytes BIGINT (file size in bytes)checksum VARCHAR (SHA-256 of file content)disk_path VARCHAR (local filesystem path: /data/{file_id})content_type VARCHAR (MIME type, optional)created_at TIMESTAMPTZ (upload timestamp)updated_at TIMESTAMPTZ (last modification timestamp)

Indexes: idx_files_filename ON (filename) — prefix queries for LIST

At 25 uploads/sec base rate, table grows approximately 2M rows/day. Metadata is not the bottleneck — disk I/O for file data is.

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
Naive (Single NFS-Like Server)	T1	50-300ms small files, minutes for large files	~400 ops/sec (single disk I/O ceiling)	$300/month (single server + RDS)	Minimal — 3 components, no distribution	99% (single disk, no redundancy)
Metadata + Block Storage (HDFS/S3 Pattern)	T2	<100ms small objects, seconds for large objects	650K ops/sec (distributed I/O)	$5,000/month (8 components, 3x replication)	Medium — metadata/data separation, repair pipeline	99.99% (3x replication, 11-nines durability)
Erasure-Coded Multi-Region Storage	T3	<50ms small objects (erasure decode adds 5-15ms)	2M ops/sec (erasure + multi-region)	$8,000/month (11 components, 1.4x storage + cross-region)	High — Raft, erasure coding, lifecycle tiering	99.999% (erasure coding, multi-region DR)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why is distributed file system design a top interview question?

Distributed file systems combine the fundamental challenges of system design: data durability (how to survive hardware failures), scalability (how to grow beyond one machine), consistency (how to keep metadata and data in sync), and performance (how to achieve high throughput for diverse file sizes). Every major tech company operates a distributed file system — Google has GFS/Colossus, Amazon has S3, Meta has Tectonic, Microsoft has Azure Blob Storage. Interviewers use this question to test whether candidates understand the progression from single-server to distributed architecture and can articulate the trade-offs at each step.

Why does the naive approach fail at just 400 operations per second?

All file I/O flows through one disk controller on one server. A typical NVMe SSD supports approximately 500,000 random IOPS but only 3-5 GB/s sequential throughput. At 400 mixed operations per second with an average file size of 5 MB, the server is transferring 2 GB/s — approaching the disk bandwidth ceiling. Large file operations are particularly destructive: a single 10 GB upload consumes the disk controller for 2-3 seconds, during which all other operations queue. The database itself can handle 400 RPS easily; disk I/O is the hard bottleneck.

What is the first optimization an interviewer expects?

Separate the metadata plane from the data plane. Instead of storing files on local disk and metadata in a co-located database, use a dedicated MetadataService backed by a fast cache (Redis) for metadata lookups, and distribute file data across multiple storage nodes (BlockStore). This separation allows each plane to scale independently: metadata operations are lightweight (100-byte rows) and can be cached aggressively, while data operations are I/O-heavy and benefit from parallelism across many disks.

How does chunking improve large file performance?

Without chunking, a 10 GB file is one sequential write on one disk taking 2-3 seconds. With 64 MB chunking, the same file is split into approximately 160 blocks. These blocks can be written in parallel across 160 different storage nodes — reducing upload time from seconds to tens of milliseconds (limited by the slowest block write). Chunking also enables resumable uploads: if the connection drops at 80%, only the incomplete block needs to be re-uploaded instead of the entire 10 GB file. This is why every production file system (HDFS, S3, GFS) uses chunking.

What is the durability difference between no replication and 3x replication?

With no replication (naive approach), durability equals the reliability of a single disk — approximately 99.9% annually (1-2% annual failure rate). Over 5 years, there is roughly a 5-10% chance of data loss. With 3x replication across independent fault domains (v1 approach), durability reaches 99.999999999% (eleven nines) — data survives any 2 simultaneous disk or node failures. The probability of losing data drops from 'likely within years' to 'effectively never in a human lifetime.' This dramatic improvement is why replication is non-negotiable for production file systems.

Related Templates

Distributed File System — Metadata + Block Storage Distributed File System — Erasure-Coded Multi-Region

Discussion

Ready to design your own Distributed File System?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator