Medium9 componentsInterview: Very High

Dropbox / File Sync — Block-Level Chunked (SHA-256 + Kafka)

Q: Why 4MB block size instead of variable-size chunking (like rsync)?

Fixed 4MB blocks simplify the architecture significantly. Variable-size chunking (content-defined chunking using Rabin fingerprints) provides better dedup for content that shifts within a file (e.g., inserting text at the beginning pushes all subsequent bytes, changing all fixed-block boundaries). However, variable chunking adds complexity: the client must implement the Rabin fingerprint algorithm, block boundaries are unpredictable, and metadata must track both block hash and block size. Dropbox uses fixed 4MB blocks in practice and achieves 30-50% dedup savings. The marginal improvement from variable chunking (estimated 5-10% additional savings) does not justify the complexity for most workloads.

Q: What happens when two devices edit the same file simultaneously?

The block-level architecture detects conflicts via version numbers. When Device A uploads version N+1, it includes the base version (N) in the upload-init request. If Device B also tries to upload version N+1 from base version N, SyncService detects the conflict (version N+1 already exists from Device A). Rather than silently overwriting, SyncService creates a 'conflicted copy' — Device B's version is saved as 'filename (conflicted copy - Device B - timestamp)'. The user must manually merge the two versions. This is Dropbox's actual conflict resolution strategy for non-collaborative file types.

Q: How does garbage collection handle the grace period safely?

The 24-hour grace period works as follows: (1) When a file is deleted, DeduplicationWorker decrements ref_count for each block in the file's block list. (2) When ref_count reaches 0, the block is marked with a 'delete_after' timestamp set to now + 24 hours. (3) A separate GC sweep (hourly) finds blocks where delete_after < now and deletes them from S3. (4) If a new upload references the block before delete_after, the ref_count is incremented back above 0, and the delete_after timestamp is cleared. The grace period must exceed the maximum expected upload duration (a 50GB file at 100 Mbps takes approximately 67 minutes).

Q: How does the dedup check work during upload-init?

The client sends all block SHA-256 hashes in the upload-init request. SyncService checks each hash against MetadataCache (Redis) first — cache hit in 2ms confirms the block exists. On cache miss, SyncService queries MetadataDB (PostgreSQL) blocks table — a SELECT by primary key (block_hash) in 15ms. The response includes only the hashes that are NOT found in either cache or database — these are the blocks the client must upload. For a typical edit to a 1GB file (250 blocks, 2 changed), the dedup check finds 248 existing blocks and returns only 2 hashes for upload.

Q: Why not use S3 Transfer Acceleration or multipart upload?

S3 Transfer Acceleration routes uploads through CloudFront edge locations for faster cross-region transfers. It is orthogonal to the chunking architecture — blocks can use Transfer Acceleration for the S3 PUT. S3 multipart upload is designed for large single-object uploads (splitting one S3 object into parts). Our blocks are already 4MB — small enough for a single PUT. Multipart upload would add complexity (managing upload IDs and part ETags) without benefit, since each block is already independently uploadable.

Industry-standard file sync architecture using content-addressed 4MB blocks with SHA-256 hashing. Only changed blocks are uploaded, enabling 90%+ bandwidth savings on typical edits. Kafka-based sync notifications for near-real-time propagation. Deduplication worker handles garbage collection of orphaned blocks.

StorageKafkaContent AddressingDedupFile Sync

Try in Simulator

Problem Statement

The block-level chunked approach to file sync represents the industry-standard architecture pioneered by Dropbox in 2007 and adopted by every major cloud storage provider. It solves the two fundamental problems with the naive approach: bandwidth waste from whole-file uploads and storage waste from lack of dedup.

The key insight is splitting files into fixed-size blocks (4MB) and computing a SHA-256 hash for each block. The hash serves as both a content fingerprint and a storage key (content addressing). When a user edits a file, only the blocks whose content changed have new hashes — the client uploads only those blocks. For a typical edit to a 1GB file (250 blocks), only 1-3 blocks change, reducing upload size from 1GB to 4-12MB. This is a 100-250x bandwidth improvement over the naive approach.

Content addressing also enables dedup at multiple levels. Within a user's files: if the same block appears in multiple file versions, it is stored once. Across users: if 100 users have the same 4MB block (common for OS files, shared libraries, and identical documents), the block is stored once in S3 with a reference count of 100. Dropbox reported 30-50% total storage savings from this approach. The SHA-256 hash is collision-resistant to 2^128, making hash collisions effectively impossible.

Resumability is a natural consequence of chunking. If a network failure interrupts an upload, only the currently-uploading 4MB block needs to be retried — all previously uploaded blocks are already stored and indexed. For a 50GB file, this means a failure at 99% progress loses at most 4MB of work instead of 49.5GB.

Sync notifications use Kafka (SyncStream) rather than polling. When a file version is committed, a file_changed event is published to Kafka partitioned by user_id. Connected devices subscribe to their user's partition and receive near-real-time notifications. This eliminates the 95% of wasted poll queries from the naive approach while providing faster notification (sub-5 seconds versus 0-5 second polling delay).

The architecture introduces a DeduplicationWorker for garbage collection. When files are deleted or old versions expire, the worker decrements block reference counts. Blocks with zero references are marked for deletion from S3 after a 24-hour grace period. The grace period prevents race conditions where a concurrent upload references a block that is being deleted.

Interviewers expect candidates to explain why 4MB is the optimal block size (balancing dedup granularity, metadata overhead, and upload parallelism), discuss SHA-256 as a content-addressing scheme, reason about the dedup savings, analyze Kafka's role in sync notification, and explain the garbage collection challenge.

Architecture Overview

The block-level chunked architecture uses nine components organized into four layers: traffic entry (UploadClient, DownloadClient, ApiGateway, MainLB), application logic (SyncService), caching (MetadataCache/Redis), persistent storage (MetadataDB/PostgreSQL, ObjectStorage/S3), and async processing (SyncStream/Kafka, DeduplicationWorker).

The upload path handles file changes. UploadClient detects a file change via filesystem watcher, splits the file into 4MB blocks, computes SHA-256 hashes, and sends POST /api/v1/files/upload-init with the file path and list of block hashes. SyncService checks MetadataCache (Redis) and MetadataDB (PostgreSQL) for existing blocks — this is the dedup check. The response tells the client which blocks are new and need uploading. For typical edits, 90%+ of blocks already exist. The client then uploads only the new blocks via PUT /api/v1/files/chunks/{hash} to ObjectStorage (S3). Each block is keyed by its SHA-256 hash (content-addressed). On finalize, SyncService updates MetadataDB with the new file version (ordered list of block hashes), writes through to MetadataCache, and publishes a file_changed event to SyncStream (Kafka).

The download path handles file retrieval. DownloadClient requests file metadata from SyncService, which resolves the file path to a list of block hashes from MetadataCache or MetadataDB. The client already has most blocks locally from previous syncs — it downloads only the blocks it does not have. Block downloads come directly from ObjectStorage (S3).

MetadataCache (Redis) provides sub-2ms lookups for hot file metadata (file-to-block mappings, block existence checks). With 85% hit rate, the cache reduces MetadataDB read load by approximately 6x. Write-through invalidation ensures consistency on file version updates.

SyncStream (Kafka) carries file_changed events from SyncService to connected devices and DeduplicationWorker. Partitioned by user_id for per-user ordering. 32 partitions with 7-day retention for offline device catch-up.

DeduplicationWorker handles block garbage collection. When files are deleted, the worker decrements block reference counts in MetadataDB. Blocks with ref_count=0 are deleted from ObjectStorage after a 24-hour grace period. 10 workers process GC in parallel.

Horizontal scaling: SyncService scales based on request throughput (add pods). MetadataDB scales via read replicas for reads, sharding for writes. ObjectStorage (S3) scales infinitely. Kafka scales by partition count.

Architecture Preview

Loading architecture preview...

Open in Simulator

Key Design Decisions

4MB Fixed Block Size

Choice

Split files into 4MB content-addressed blocks

Rationale

Block size is a critical trade-off. Too small (e.g., 64KB): excessive metadata overhead — a 1GB file generates 16,384 block entries. Too large (e.g., 64MB): poor dedup granularity — a 1-byte edit re-uploads 64MB. 4MB is the sweet spot used by Dropbox: a 1GB file generates 250 manageable block entries, and a typical edit uploads 4-12MB (1-3 changed blocks). The 4MB size also aligns with S3's multipart upload minimum part size.

SHA-256 Content Addressing

Choice

SHA-256 hash of block content serves as both fingerprint and S3 key

Rationale

Content addressing enables dedup without a separate dedup index. If two users upload identical blocks, the SHA-256 hash is the same, and the S3 PUT is a no-op (idempotent write to the same key). The hash also serves as an integrity check — the downloaded block must hash to its key. SHA-256 collision probability is 1 in 2^128, making accidental collisions impossible in practice. The trade-off is CPU cost: hashing a 4MB block takes approximately 15ms on modern hardware.

Redis Metadata Cache (85% Hit Rate)

Choice

ElastiCache Redis caching file-to-block mappings with LRU eviction

Rationale

File metadata lookups are the hottest path — every upload-init, download, and sync operation needs them. Redis provides sub-2ms lookups versus 15ms for PostgreSQL. With 85% cache hit rate, Redis handles approximately 21K of the 25K peak metadata reads/sec, reducing PostgreSQL load to approximately 4K reads/sec. Write-through invalidation on file version updates ensures the cache never serves stale block lists.

Kafka for Sync Event Streaming

Choice

MSK Kafka with user_id partition key for ordered sync events

Rationale

Sync events must fan out to all of a user's connected devices with guaranteed delivery and ordering. Kafka's consumer group model handles both requirements: events are retained until consumed (handling offline devices) and partitioning by user_id guarantees ordering within a user's file changes. 7-day retention means a device offline for a week can catch up on all missed changes. This eliminates the polling overhead of the naive approach — zero wasted queries.

Async Garbage Collection with Grace Period

Choice

DeduplicationWorker with 24-hour grace period before block deletion

Rationale

Deleting blocks immediately when ref_count reaches zero risks data loss. Consider: User A uploads file version N referencing block X. Before commit, User B deletes the last file referencing block X, triggering immediate deletion. User A's commit succeeds but the file is now corrupted — block X no longer exists. The 24-hour grace period ensures all in-flight uploads complete before blocks are deleted. The trade-off is approximately 24 hours of orphaned block storage cost.

S3 for Immutable Block Storage

Choice

Each 4MB block stored as a separate S3 object keyed by SHA-256 hash

Rationale

S3 provides 99.999999999% durability — critical for user files. Content-addressed blocks are immutable once written (the hash never changes), making S3's eventual consistency model safe — there are no read-after-write races because blocks are never updated, only created and eventually deleted. S3 also scales infinitely, handling the 1+ EB storage requirement without capacity planning.

Scale & Performance

Target RPS

25K peak (5K chunks + 12.5K reads + 5K sync + 2.5K init)

Latency (p99)

0.3s per block upload, <5s sync notification, 2ms cache hit

Storage

3-5 PB at 1M users (30-50% dedup savings)

Availability

99.9% (multi-AZ, Redis HA, Kafka replication)

Database Schema (HLD)

files

Source of truth for file metadata mapping user+path to ordered list of block hashes. Each row represents the current version of a file. Partitioned across 32 PostgreSQL shards by file_id. Write-heavy during uploads (new version on every sync).

file_id UUID PKuser_id UUID (shard key)path TEXTblock_hashes TEXT[] (ordered SHA-256 list)version INTEGERsize_bytes BIGINTlast_modified TIMESTAMPTZ

Indexes: PK on file_id, UNIQUE on (user_id, path), idx_files_user ON (user_id)

The block_hashes array is the core data structure — an ordered list of SHA-256 hashes representing the file content. For a 1GB file, this array has 250 entries (~16KB). Read on every download and dedup check. Written on every upload-finalize.

blocks

Block reference tracking table mapping SHA-256 hashes to reference counts. Used for dedup checks (does this block exist?) and garbage collection (can this block be deleted?). Cross-user — a single entry serves all users referencing that content.

block_hash TEXT PK (SHA-256)ref_count INTEGERsize_bytes INTEGERcreated_at TIMESTAMPTZdelete_after TIMESTAMPTZ (null if ref_count > 0)

Indexes: PK on block_hash, idx_blocks_gc ON (delete_after) WHERE delete_after IS NOT NULL

Ref_count is incremented on upload-finalize and decremented on file delete. When ref_count reaches 0, delete_after is set to now + 24 hours. The GC sweep index targets only blocks pending deletion.

file-changed (Kafka topic)

Kafka topic carrying sync notification events. Partitioned by user_id (32 partitions) for per-user ordering. 7-day retention for offline device catch-up. Consumed by connected devices for sync and by DeduplicationWorker for GC.

user_id TEXT (partition key)file_path TEXTversion INTEGERchange_type TEXT (create/update/delete)timestamp BIGINT

Indexes: Partitioned by user_id (32 partitions)

Two consumer groups: device sync (real-time notification) and DeduplicationWorker (GC processing). Consumer lag is the key metric — lag > 5 seconds means devices receive delayed sync notifications.

Event Contracts

file_changedfile-changed

File sync events published by SyncService on every file create, update, or delete. Consumed by connected devices for sync notification and by DeduplicationWorker for block garbage collection.

Key Schema

user_id (string)

Value Schema

{ user_id: string, file_path: string, version: number, change_type: create|update|delete, block_hashes?: string[], timestamp: number }

Solution Comparison

Variant	Tier	Latency	Throughput	Cost	Complexity	Reliability
V0: Naive (Whole-File Upload + Polling)	T1	80s+ upload (1GB), 0-5s sync delay	~1K RPS	$500/month	Low	99% (single DB)
V1: Block-Level Chunked (SHA-256 + Kafka)	T2	0.3s upload (changed block), <5s sync	25K RPS peak	$3,000/month	Medium	99.9% (multi-AZ)
V2: Delta Sync + Dedup (WebSocket + CDN)	T3	<0.1s delta upload, <2s sync	25K RPS peak	$8,000/month	High	99.95% (multi-AZ, CDN)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

Why 4MB block size instead of variable-size chunking (like rsync)?

Fixed 4MB blocks simplify the architecture significantly. Variable-size chunking (content-defined chunking using Rabin fingerprints) provides better dedup for content that shifts within a file (e.g., inserting text at the beginning pushes all subsequent bytes, changing all fixed-block boundaries). However, variable chunking adds complexity: the client must implement the Rabin fingerprint algorithm, block boundaries are unpredictable, and metadata must track both block hash and block size. Dropbox uses fixed 4MB blocks in practice and achieves 30-50% dedup savings. The marginal improvement from variable chunking (estimated 5-10% additional savings) does not justify the complexity for most workloads.

What happens when two devices edit the same file simultaneously?

The block-level architecture detects conflicts via version numbers. When Device A uploads version N+1, it includes the base version (N) in the upload-init request. If Device B also tries to upload version N+1 from base version N, SyncService detects the conflict (version N+1 already exists from Device A). Rather than silently overwriting, SyncService creates a 'conflicted copy' — Device B's version is saved as 'filename (conflicted copy - Device B - timestamp)'. The user must manually merge the two versions. This is Dropbox's actual conflict resolution strategy for non-collaborative file types.

How does garbage collection handle the grace period safely?

The 24-hour grace period works as follows: (1) When a file is deleted, DeduplicationWorker decrements ref_count for each block in the file's block list. (2) When ref_count reaches 0, the block is marked with a 'delete_after' timestamp set to now + 24 hours. (3) A separate GC sweep (hourly) finds blocks where delete_after < now and deletes them from S3. (4) If a new upload references the block before delete_after, the ref_count is incremented back above 0, and the delete_after timestamp is cleared. The grace period must exceed the maximum expected upload duration (a 50GB file at 100 Mbps takes approximately 67 minutes).

How does the dedup check work during upload-init?

The client sends all block SHA-256 hashes in the upload-init request. SyncService checks each hash against MetadataCache (Redis) first — cache hit in 2ms confirms the block exists. On cache miss, SyncService queries MetadataDB (PostgreSQL) blocks table — a SELECT by primary key (block_hash) in 15ms. The response includes only the hashes that are NOT found in either cache or database — these are the blocks the client must upload. For a typical edit to a 1GB file (250 blocks, 2 changed), the dedup check finds 248 existing blocks and returns only 2 hashes for upload.

Why not use S3 Transfer Acceleration or multipart upload?

S3 Transfer Acceleration routes uploads through CloudFront edge locations for faster cross-region transfers. It is orthogonal to the chunking architecture — blocks can use Transfer Acceleration for the S3 PUT. S3 multipart upload is designed for large single-object uploads (splitting one S3 object into parts). Our blocks are already 4MB — small enough for a single PUT. Multipart upload would add complexity (managing upload IDs and part ETags) without benefit, since each block is already independently uploadable.

Related Templates

Dropbox / File Sync — Naive (Whole-File Upload + Polling)Dropbox / File Sync — Delta Sync + Dedup (WebSocket + CDN)Video Streaming — Naive Distributed Filesystem (HDFS/GFS)

Discussion

Ready to design your own Dropbox / File Sync?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator