Industry-standard file sync architecture using content-addressed 4MB blocks with SHA-256 hashing. Only changed blocks are uploaded, enabling 90%+ bandwidth savings on typical edits. Kafka-based sync notifications for near-real-time propagation. Deduplication worker handles garbage collection of orphaned blocks.
The block-level chunked approach to file sync represents the industry-standard architecture pioneered by Dropbox in 2007 and adopted by every major cloud storage provider. It solves the two fundamental problems with the naive approach: bandwidth waste from whole-file uploads and storage waste from lack of dedup.
The key insight is splitting files into fixed-size blocks (4MB) and computing a SHA-256 hash for each block. The hash serves as both a content fingerprint and a storage key (content addressing). When a user edits a file, only the blocks whose content changed have new hashes — the client uploads only those blocks. For a typical edit to a 1GB file (250 blocks), only 1-3 blocks change, reducing upload size from 1GB to 4-12MB. This is a 100-250x bandwidth improvement over the naive approach.
Content addressing also enables dedup at multiple levels. Within a user's files: if the same block appears in multiple file versions, it is stored once. Across users: if 100 users have the same 4MB block (common for OS files, shared libraries, and identical documents), the block is stored once in S3 with a reference count of 100. Dropbox reported 30-50% total storage savings from this approach. The SHA-256 hash is collision-resistant to 2^128, making hash collisions effectively impossible.
Resumability is a natural consequence of chunking. If a network failure interrupts an upload, only the currently-uploading 4MB block needs to be retried — all previously uploaded blocks are already stored and indexed. For a 50GB file, this means a failure at 99% progress loses at most 4MB of work instead of 49.5GB.
Sync notifications use Kafka (SyncStream) rather than polling. When a file version is committed, a file_changed event is published to Kafka partitioned by user_id. Connected devices subscribe to their user's partition and receive near-real-time notifications. This eliminates the 95% of wasted poll queries from the naive approach while providing faster notification (sub-5 seconds versus 0-5 second polling delay).
The architecture introduces a DeduplicationWorker for garbage collection. When files are deleted or old versions expire, the worker decrements block reference counts. Blocks with zero references are marked for deletion from S3 after a 24-hour grace period. The grace period prevents race conditions where a concurrent upload references a block that is being deleted.
Interviewers expect candidates to explain why 4MB is the optimal block size (balancing dedup granularity, metadata overhead, and upload parallelism), discuss SHA-256 as a content-addressing scheme, reason about the dedup savings, analyze Kafka's role in sync notification, and explain the garbage collection challenge.
The block-level chunked architecture uses nine components organized into four layers: traffic entry (UploadClient, DownloadClient, ApiGateway, MainLB), application logic (SyncService), caching (MetadataCache/Redis), persistent storage (MetadataDB/PostgreSQL, ObjectStorage/S3), and async processing (SyncStream/Kafka, DeduplicationWorker).
The upload path handles file changes. UploadClient detects a file change via filesystem watcher, splits the file into 4MB blocks, computes SHA-256 hashes, and sends POST /api/v1/files/upload-init with the file path and list of block hashes. SyncService checks MetadataCache (Redis) and MetadataDB (PostgreSQL) for existing blocks — this is the dedup check. The response tells the client which blocks are new and need uploading. For typical edits, 90%+ of blocks already exist. The client then uploads only the new blocks via PUT /api/v1/files/chunks/{hash} to ObjectStorage (S3). Each block is keyed by its SHA-256 hash (content-addressed). On finalize, SyncService updates MetadataDB with the new file version (ordered list of block hashes), writes through to MetadataCache, and publishes a file_changed event to SyncStream (Kafka).
The download path handles file retrieval. DownloadClient requests file metadata from SyncService, which resolves the file path to a list of block hashes from MetadataCache or MetadataDB. The client already has most blocks locally from previous syncs — it downloads only the blocks it does not have. Block downloads come directly from ObjectStorage (S3).
MetadataCache (Redis) provides sub-2ms lookups for hot file metadata (file-to-block mappings, block existence checks). With 85% hit rate, the cache reduces MetadataDB read load by approximately 6x. Write-through invalidation ensures consistency on file version updates.
SyncStream (Kafka) carries file_changed events from SyncService to connected devices and DeduplicationWorker. Partitioned by user_id for per-user ordering. 32 partitions with 7-day retention for offline device catch-up.
DeduplicationWorker handles block garbage collection. When files are deleted, the worker decrements block reference counts in MetadataDB. Blocks with ref_count=0 are deleted from ObjectStorage after a 24-hour grace period. 10 workers process GC in parallel.
Horizontal scaling: SyncService scales based on request throughput (add pods). MetadataDB scales via read replicas for reads, sharding for writes. ObjectStorage (S3) scales infinitely. Kafka scales by partition count.
Choice
Split files into 4MB content-addressed blocks
Rationale
Block size is a critical trade-off. Too small (e.g., 64KB): excessive metadata overhead — a 1GB file generates 16,384 block entries. Too large (e.g., 64MB): poor dedup granularity — a 1-byte edit re-uploads 64MB. 4MB is the sweet spot used by Dropbox: a 1GB file generates 250 manageable block entries, and a typical edit uploads 4-12MB (1-3 changed blocks). The 4MB size also aligns with S3's multipart upload minimum part size.
Choice
SHA-256 hash of block content serves as both fingerprint and S3 key
Rationale
Content addressing enables dedup without a separate dedup index. If two users upload identical blocks, the SHA-256 hash is the same, and the S3 PUT is a no-op (idempotent write to the same key). The hash also serves as an integrity check — the downloaded block must hash to its key. SHA-256 collision probability is 1 in 2^128, making accidental collisions impossible in practice. The trade-off is CPU cost: hashing a 4MB block takes approximately 15ms on modern hardware.
Choice
ElastiCache Redis caching file-to-block mappings with LRU eviction
Rationale
File metadata lookups are the hottest path — every upload-init, download, and sync operation needs them. Redis provides sub-2ms lookups versus 15ms for PostgreSQL. With 85% cache hit rate, Redis handles approximately 21K of the 25K peak metadata reads/sec, reducing PostgreSQL load to approximately 4K reads/sec. Write-through invalidation on file version updates ensures the cache never serves stale block lists.
Choice
MSK Kafka with user_id partition key for ordered sync events
Rationale
Sync events must fan out to all of a user's connected devices with guaranteed delivery and ordering. Kafka's consumer group model handles both requirements: events are retained until consumed (handling offline devices) and partitioning by user_id guarantees ordering within a user's file changes. 7-day retention means a device offline for a week can catch up on all missed changes. This eliminates the polling overhead of the naive approach — zero wasted queries.
Choice
DeduplicationWorker with 24-hour grace period before block deletion
Rationale
Deleting blocks immediately when ref_count reaches zero risks data loss. Consider: User A uploads file version N referencing block X. Before commit, User B deletes the last file referencing block X, triggering immediate deletion. User A's commit succeeds but the file is now corrupted — block X no longer exists. The 24-hour grace period ensures all in-flight uploads complete before blocks are deleted. The trade-off is approximately 24 hours of orphaned block storage cost.
Choice
Each 4MB block stored as a separate S3 object keyed by SHA-256 hash
Rationale
S3 provides 99.999999999% durability — critical for user files. Content-addressed blocks are immutable once written (the hash never changes), making S3's eventual consistency model safe — there are no read-after-write races because blocks are never updated, only created and eventually deleted. S3 also scales infinitely, handling the 1+ EB storage requirement without capacity planning.
Target RPS
25K peak (5K chunks + 12.5K reads + 5K sync + 2.5K init)
Latency (p99)
0.3s per block upload, <5s sync notification, 2ms cache hit
Storage
3-5 PB at 1M users (30-50% dedup savings)
Availability
99.9% (multi-AZ, Redis HA, Kafka replication)
Source of truth for file metadata mapping user+path to ordered list of block hashes. Each row represents the current version of a file. Partitioned across 32 PostgreSQL shards by file_id. Write-heavy during uploads (new version on every sync).
Indexes: PK on file_id, UNIQUE on (user_id, path), idx_files_user ON (user_id)
The block_hashes array is the core data structure — an ordered list of SHA-256 hashes representing the file content. For a 1GB file, this array has 250 entries (~16KB). Read on every download and dedup check. Written on every upload-finalize.
Block reference tracking table mapping SHA-256 hashes to reference counts. Used for dedup checks (does this block exist?) and garbage collection (can this block be deleted?). Cross-user — a single entry serves all users referencing that content.
Indexes: PK on block_hash, idx_blocks_gc ON (delete_after) WHERE delete_after IS NOT NULL
Ref_count is incremented on upload-finalize and decremented on file delete. When ref_count reaches 0, delete_after is set to now + 24 hours. The GC sweep index targets only blocks pending deletion.
Kafka topic carrying sync notification events. Partitioned by user_id (32 partitions) for per-user ordering. 7-day retention for offline device catch-up. Consumed by connected devices for sync and by DeduplicationWorker for GC.
Indexes: Partitioned by user_id (32 partitions)
Two consumer groups: device sync (real-time notification) and DeduplicationWorker (GC processing). Consumer lag is the key metric — lag > 5 seconds means devices receive delayed sync notifications.
File sync events published by SyncService on every file create, update, or delete. Consumed by connected devices for sync notification and by DeduplicationWorker for block garbage collection.
Key Schema
user_id (string)
Value Schema
{ user_id: string, file_path: string, version: number, change_type: create|update|delete, block_hashes?: string[], timestamp: number }
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Whole-File Upload + Polling) | T1 | 80s+ upload (1GB), 0-5s sync delay | ~1K RPS | $500/month | Low | 99% (single DB) |
| V1: Block-Level Chunked (SHA-256 + Kafka) | T2 | 0.3s upload (changed block), <5s sync | 25K RPS peak | $3,000/month | Medium | 99.9% (multi-AZ) |
| V2: Delta Sync + Dedup (WebSocket + CDN) | T3 | <0.1s delta upload, <2s sync | 25K RPS peak | $8,000/month | High | 99.95% (multi-AZ, CDN) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Fixed 4MB blocks simplify the architecture significantly. Variable-size chunking (content-defined chunking using Rabin fingerprints) provides better dedup for content that shifts within a file (e.g., inserting text at the beginning pushes all subsequent bytes, changing all fixed-block boundaries). However, variable chunking adds complexity: the client must implement the Rabin fingerprint algorithm, block boundaries are unpredictable, and metadata must track both block hash and block size. Dropbox uses fixed 4MB blocks in practice and achieves 30-50% dedup savings. The marginal improvement from variable chunking (estimated 5-10% additional savings) does not justify the complexity for most workloads.
The block-level architecture detects conflicts via version numbers. When Device A uploads version N+1, it includes the base version (N) in the upload-init request. If Device B also tries to upload version N+1 from base version N, SyncService detects the conflict (version N+1 already exists from Device A). Rather than silently overwriting, SyncService creates a 'conflicted copy' — Device B's version is saved as 'filename (conflicted copy - Device B - timestamp)'. The user must manually merge the two versions. This is Dropbox's actual conflict resolution strategy for non-collaborative file types.
The 24-hour grace period works as follows: (1) When a file is deleted, DeduplicationWorker decrements ref_count for each block in the file's block list. (2) When ref_count reaches 0, the block is marked with a 'delete_after' timestamp set to now + 24 hours. (3) A separate GC sweep (hourly) finds blocks where delete_after < now and deletes them from S3. (4) If a new upload references the block before delete_after, the ref_count is incremented back above 0, and the delete_after timestamp is cleared. The grace period must exceed the maximum expected upload duration (a 50GB file at 100 Mbps takes approximately 67 minutes).
The client sends all block SHA-256 hashes in the upload-init request. SyncService checks each hash against MetadataCache (Redis) first — cache hit in 2ms confirms the block exists. On cache miss, SyncService queries MetadataDB (PostgreSQL) blocks table — a SELECT by primary key (block_hash) in 15ms. The response includes only the hashes that are NOT found in either cache or database — these are the blocks the client must upload. For a typical edit to a 1GB file (250 blocks, 2 changed), the dedup check finds 248 existing blocks and returns only 2 hashes for upload.
S3 Transfer Acceleration routes uploads through CloudFront edge locations for faster cross-region transfers. It is orthogonal to the chunking architecture — blocks can use Transfer Acceleration for the S3 PUT. S3 multipart upload is designed for large single-object uploads (splitting one S3 object into parts). Our blocks are already 4MB — small enough for a single PUT. Multipart upload would add complexity (managing upload IDs and part ETags) without benefit, since each block is already independently uploadable.
Sign in to join the discussion.
Ready to design your own Dropbox / File Sync?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator