Full production file sync architecture with binary delta uploads (rsync-style rolling hash), content-addressed block storage (SHA-256), cross-user dedup, WebSocket push notifications for sub-second sync, CDN-accelerated downloads, and automated garbage collection. Separate upload and download paths with independent scaling.
The delta sync approach represents the production-grade file sync architecture used by Dropbox, Google Drive, and OneDrive at scale. It builds on block-level chunking (V1) by adding three key optimizations: binary delta uploads, WebSocket push notifications, and CDN-accelerated downloads.
Binary delta sync is the most significant optimization. Even with 4MB block-level chunking, a small edit to a block still requires re-uploading the full 4MB. Delta sync computes a binary diff (using rsync-style rolling hash) between the old and new versions of a changed block, uploading only the changed bytes. For a typical text edit (100 bytes changed in a 4MB block), the delta is approximately 200 bytes — a 20,000x reduction from uploading the full block. Dropbox implemented this optimization in 2012 and reported bandwidth savings exceeding 95% for typical document editing workflows.
The delta sync protocol works as follows: (1) Client detects file change and splits into 4MB blocks with SHA-256 hashes. (2) Client compares new hashes with cached previous version to identify changed blocks. (3) For each changed block, client computes rolling-hash delta between old and new block versions. (4) Client uploads only the deltas to ChunkService. (5) ChunkService retrieves the base block from BlockStore (S3), applies the delta to reconstruct the new block, verifies SHA-256 integrity, and stores the new block. This adds one S3 read per delta upload but saves massive bandwidth.
WebSocket push replaces Kafka-based polling for sync notifications. Each connected device maintains a persistent WebSocket connection to NotificationService. When a file_changed event is published to SyncStream (Kafka), NotificationService pushes it to all connected devices for that user within 200ms. This provides sub-second sync awareness versus the 5-second polling delay in V0 or the variable Kafka consumer lag in V1. At 100M connected devices, NotificationService requires approximately 20 pods handling 5M connections each.
CDN-accelerated downloads serve content-addressed blocks from CloudFront edge caches. Since blocks are immutable (the SHA-256 hash never changes), they can be cached indefinitely at edge locations. For shared files (company documents, open-source project assets), CDN hit rate exceeds 60%, dramatically reducing both download latency (sub-10ms from edge versus 20ms+ from S3) and S3 egress costs.
The architecture separates upload and download into independent paths with their own load balancers and service pools. ChunkService handles the CPU-intensive upload path (delta reconstruction, SHA-256 verification). DownloadService handles the I/O-intensive download path (metadata lookup, pre-signed URL generation). This separation allows each path to scale independently based on its resource requirements.
Garbage collection uses a 72-hour grace period (extended from V1's 24 hours) to handle the longer tail of large delta uploads. GCWorker also manages version history cleanup — deleting file versions older than 30 days to control MetadataDB growth.
Interviewers at senior levels expect candidates to explain the delta sync protocol, discuss content-addressed block caching at CDN edge nodes, reason about WebSocket connection management at scale, analyze the trade-off between delta reconstruction overhead and bandwidth savings, and design the garbage collection strategy for sharded metadata.
The delta sync architecture uses eleven components organized into five layers: traffic entry (Client, ApiGateway), upload path (UploadLB, ChunkService), download path (DownloadLB, DownloadService, CDN), notification path (NotificationService consuming from SyncStream), data stores (MetadataDB, BlockStore), and async processing (SyncStream/Kafka, GCWorker).
The upload path handles file changes with delta optimization. Client detects a file change, splits into 4MB blocks, computes SHA-256 hashes, and compares with the cached previous version. For changed blocks, the client computes a binary delta (rsync rolling hash) between old and new block versions. The upload-init request sends all new hashes and delta sizes to ChunkService via ApiGateway and UploadLB. ChunkService checks MetadataDB for existing blocks (dedup check) and returns only the missing hashes. The client then uploads deltas for changed blocks — typically 1-10% of full block size. ChunkService retrieves the base block from BlockStore, applies the delta to reconstruct the new block, verifies SHA-256 integrity, and stores the reconstructed block in BlockStore. On finalize, ChunkService updates MetadataDB (sharded by user_id across 64 partitions) and publishes a file_changed event to SyncStream. 15 ChunkService pods handle the CPU-intensive delta reconstruction.
The download path serves file content with CDN acceleration. DownloadService (10 pods) receives requests via ApiGateway and DownloadLB, resolves file paths to block hash lists from MetadataDB, and generates pre-signed S3 URLs or CDN URLs for each block. The client downloads only blocks not in its local cache. CloudFront CDN edge caches serve popular blocks with sub-10ms latency — content-addressed blocks are immutable and cacheable indefinitely, achieving 60%+ hit rate for shared content.
The notification path provides sub-second sync awareness. NotificationService (20 pods) maintains persistent WebSocket connections with all connected devices. It consumes file_changed events from SyncStream (Kafka) and pushes them to the appropriate user's WebSocket connections. Notifications arrive within 200ms of the Kafka publish, compared to 5-second polling in V0. At 100M connected devices, NotificationService manages approximately 5M connections per pod.
MetadataDB (PostgreSQL) is sharded by user_id across 64 partitions. Each shard handles approximately 11M users at 700M total users. Sharding by user_id co-locates all of a user's file operations on the same shard, avoiding cross-shard queries. The blocks table is a global table (not sharded) since block dedup is cross-user.
BlockStore (S3) uses SHA-256 content hash as the S3 key. Blocks are immutable once written — never updated, only created and eventually deleted by GCWorker. This immutability enables aggressive CDN caching and eliminates cache invalidation complexity.
GCWorker (15 workers) consumes file_changed events from SyncStream, manages block reference counts in MetadataDB, and deletes orphaned blocks from BlockStore after a 72-hour grace period. Also handles version history cleanup (deleting versions older than 30 days).
Choice
Rsync-style rolling hash delta between old and new block versions
Rationale
A 100-byte edit to a 4MB block uploads only the delta (~200 bytes) instead of the full 4MB — a 20,000x bandwidth reduction. The rolling hash algorithm (Adler-32 or xxHash for fast computation, MD5 for strong matching) identifies matching regions between old and new block versions and encodes only the differences. The trade-off is: (1) client CPU for delta computation (~50ms per block), (2) server CPU for delta reconstruction (~30ms), and (3) one additional S3 read per delta to retrieve the base block. For typical document editing, the bandwidth savings far outweigh the compute overhead.
Choice
Independent ChunkService (upload) and DownloadService (download) with separate LBs
Rationale
Uploads are write-heavy and CPU-intensive (delta reconstruction, SHA-256 verification, S3 writes). Downloads are read-heavy and I/O-intensive (metadata lookup, S3 reads, CDN URL generation). Separating them allows independent scaling: 15 ChunkService pods for upload throughput versus 10 DownloadService pods for read throughput. A combined service would need all pods sized for the heavier upload workload, wasting resources during read-heavy periods (which dominate — downloads outnumber uploads 3:1).
Choice
Persistent WebSocket connections to NotificationService replacing HTTP polling
Rationale
Polling every 5 seconds at 100M users generates 20M QPS of mostly-empty responses. WebSocket push delivers notifications in 200ms with zero wasted traffic — events are sent only when changes occur. The trade-off is connection management complexity: 100M persistent connections require approximately 20 pods with 5M connections each, rolling deployments to avoid connection storms, and exponential backoff for reconnection. But the bandwidth savings (eliminating 20M QPS) and the 10x faster notification (200ms versus 2.5s average polling delay) justify the complexity.
Choice
CloudFront edge caches for immutable content-addressed blocks
Rationale
Content-addressed blocks are ideal for CDN caching because they are immutable — the SHA-256 hash never changes, so cache invalidation is never needed. Popular shared files (company documents, open-source repos) achieve 60%+ CDN hit rate, serving downloads from 400+ edge locations with sub-10ms latency. S3 egress costs drop proportionally to CDN hit rate. The trade-off is CDN cost (approximately $0.085/GB for CloudFront versus $0.09/GB for S3 egress), which is offset by reduced S3 request costs at high hit rates.
Choice
PostgreSQL sharded by user_id across 64 partitions with read replicas
Rationale
At 700M users, a single PostgreSQL instance cannot handle the write throughput for file metadata updates. Sharding by user_id distributes writes evenly — each shard handles approximately 11M users. All of a user's file operations (list, upload, download, version history) are co-located on the same shard, avoiding cross-shard joins. The blocks table remains global (not user-sharded) since dedup is cross-user — block lookups by hash are simple primary key queries that scale well even on a single instance.
Choice
Orphaned blocks deleted 72 hours after ref_count reaches zero
Rationale
The grace period prevents data loss from race conditions between concurrent uploads and deletes. A 50GB file upload at 100 Mbps takes approximately 67 minutes. With delta sync, the upload may span multiple sessions over hours if the user is on a slow connection. The 72-hour grace period ensures even the slowest multi-session uploads complete before any referenced blocks are garbage collected. The storage cost of 72 hours of orphaned blocks is approximately 0.1% of total storage — a negligible premium for data safety.
Target RPS
25K peak (upload + download + notifications)
Latency (p99)
<0.1s delta upload (200 bytes), <2s sync notification, <50ms CDN download
Storage
3-5 PB at 1M users (30-50% dedup), scales to 1+ EB
Availability
99.95% (multi-AZ, CDN, Kafka replication)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Delta upload (rolling hash diff + reconstruction) | O(B) — B is block size (4MB), linear scan for rolling hash | O(B) — buffer for base block + delta + reconstructed block | Client computes delta in ~50ms per block (Adler-32 rolling hash). Server reconstructs in ~30ms. Total per-block overhead: ~80ms compute + 20ms S3 read for base block. Bandwidth savings: 95%+ for typical edits. Net win when delta size < 3.9MB (almost always). |
| Dedup check (upload-init) | O(H) — H is number of block hashes in the file | O(H) — hash list in request/response | For a 1GB file: H = 250. MetadataDB lookup by primary key (block_hash) is O(1) per hash. Total dedup check: 250 x O(1) = O(250) ~ 10-15ms. Cache hit on MetadataCache: 2ms per hash. Typical result: 248 existing + 2 new hashes. |
| WebSocket notification delivery | O(D) — D is number of connected devices for the user | O(1) — single event pushed per device | Average user has 2-3 connected devices. Event routing via user_id lookup in connection registry (Redis hash): O(1). Total push latency: ~200ms from Kafka publish to WebSocket delivery. |
| CDN block download (cache hit) | O(1) — direct edge cache lookup by SHA-256 hash | O(1) — 4MB block content | Content-addressed blocks are immutable — cache hit rate increases monotonically as popular blocks propagate to edge locations. No invalidation logic needed. Cache miss falls through to S3 origin (~20ms additional latency). |
File metadata mapping user+path to ordered block hashes. Sharded by user_id across 64 PostgreSQL partitions. Each shard handles approximately 11M users. Full version history retained for 30 days for rollback capability.
Indexes: PK on file_id, UNIQUE on (user_id, path), idx_files_user ON (user_id)
Sharded by user_id so all file operations for a user hit the same shard. Block_hashes array is the core data — an ordered list of SHA-256 hashes. For a 1GB file: 250 entries (~16KB). Version history stored in file_versions table for 30-day rollback.
Global block reference tracking table. Not sharded — block dedup is cross-user, so a single table serves all users. SHA-256 hash as primary key for O(1) lookups. Ref_count manages garbage collection lifecycle.
Indexes: PK on block_hash, idx_blocks_gc ON (delete_after) WHERE delete_after IS NOT NULL
The ref_count is the critical field. Incremented on upload-finalize, decremented on file delete or version expiration. When ref_count reaches 0, delete_after is set to now + 72 hours. GCWorker sweeps hourly for blocks where delete_after < now. Race condition safety: ref_count increment is atomic (UPDATE blocks SET ref_count = ref_count + 1).
Version history for file rollback. Stores previous block_hashes arrays per version. Retained for 30 days, then cleaned up by GCWorker. Enables server-side delta computation when client lacks the base version.
Indexes: PK on version_id, idx_versions_file ON (file_id, version), idx_versions_cleanup ON (created_at) WHERE created_at < now() - interval '30 days'
Co-located with the files table on the same shard (same user_id). Growth: approximately 10 versions per file per month. GCWorker deletes rows older than 30 days and decrements ref_counts for the expired version's blocks.
Kafka topic carrying file sync events. 64 partitions, partitioned by user_id. Consumed by NotificationService (WebSocket push) and GCWorker (garbage collection). 7-day retention for offline device catch-up.
Indexes: Partitioned by user_id (64 partitions)
Two consumer groups: NotificationService (real-time WebSocket push, must keep lag < 1 second) and GCWorker (garbage collection, can tolerate lag up to 1 hour). 1M msg/sec capacity handles peak sync events.
File sync events published by ChunkService on every file create, update, or delete. Consumed by NotificationService (WebSocket push to connected devices) and GCWorker (block reference count management and garbage collection).
Key Schema
user_id (string)
Value Schema
{ user_id: string, file_path: string, version: number, change_type: create|update|delete, block_hashes?: string[], delta_sizes?: number[], timestamp: number }
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Whole-File Upload + Polling) | T1 | 80s+ upload (1GB), 0-5s sync delay | ~1K RPS | $500/month | Low | 99% (single DB) |
| V1: Block-Level Chunked (SHA-256 + Kafka) | T2 | 0.3s upload (changed block), <5s sync | 25K RPS peak | $3,000/month | Medium | 99.9% (multi-AZ) |
| V2: Delta Sync + Dedup (WebSocket + CDN) | T3 | <0.1s delta upload, <2s sync | 25K RPS peak | $8,000/month | High | 99.95% (multi-AZ, CDN) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Block-level chunking (V1) uploads the full 4MB block when any byte within it changes. Delta sync (V2) uploads only the changed bytes within the block. For a 100-byte text edit in a 4MB block: V0 (naive) uploads 1GB, V1 (chunked) uploads 4MB, V2 (delta) uploads approximately 200 bytes. The savings compound for files edited frequently — a user editing a document 10 times per day saves 40MB/day with chunking versus 2KB/day with delta sync. Dropbox reported that delta sync reduced total upload bandwidth by 95% compared to block-level chunking alone.
Delta sync for downloads is technically possible but adds complexity with diminishing returns. Downloads already benefit from block-level dedup — the client downloads only blocks it does not have locally. Applying delta on the download path would require the server to compute deltas between the client's block version and the new version, which means the server needs to know what the client already has. This per-client state management is expensive at 700M users. CDN caching provides comparable download optimization (60%+ cache hit rate) without per-client state.
NotificationService maintains persistent WebSocket connections — approximately 5M per pod across 20 pods. Each connection consumes approximately 2KB of memory (connection state + user routing). At 5M connections: 10GB memory per pod. Connections are authenticated via JWT on upgrade and mapped to user_id for event routing. On pod restart (deployment), connections drop and clients reconnect with exponential backoff (0.5s to 30s jitter). Rolling deployment restarts 2-3 pods at a time, causing at most 15M reconnections. Connection registry in Redis maps user_id to pod for cross-pod routing when a single user has devices connected to different pods.
Content-addressed blocks cannot be stale — this is the fundamental advantage of content addressing for CDN caching. A block's SHA-256 hash is derived from its content. If the content changes, the hash changes, creating a new S3 key and a new CDN cache entry. The old block and the new block coexist independently in the CDN. There is no invalidation needed because the URL for a given block's content never changes. This is why content addressing is superior to path-based addressing for CDN-backed storage.
A user with 1M small files (e.g., a developer syncing node_modules) creates 1M entries in the files table on their MetadataDB shard. Each entry is approximately 500 bytes (path + block hashes for a single block). Total metadata: 500MB on one shard. The initial sync (listing all files) requires a paginated SELECT returning 1M rows — at 1000 rows per page, this takes 1000 API calls over approximately 30 seconds. The V2 architecture mitigates this with a 'namespace tree' — syncing at the directory level with recursive hash trees (Merkle trees) so unchanged subtrees are skipped entirely.
Sign in to join the discussion.
Ready to design your own Dropbox / File Sync?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator