Vetora logo
Hard11 componentsInterview: Very High

Dropbox / File Sync — Delta Sync + Dedup (WebSocket + CDN)

Full production file sync architecture with binary delta uploads (rsync-style rolling hash), content-addressed block storage (SHA-256), cross-user dedup, WebSocket push notifications for sub-second sync, CDN-accelerated downloads, and automated garbage collection. Separate upload and download paths with independent scaling.

StorageDelta SyncWebSocketCDNKafkaFile Sync
Problem Statement

The delta sync approach represents the production-grade file sync architecture used by Dropbox, Google Drive, and OneDrive at scale. It builds on block-level chunking (V1) by adding three key optimizations: binary delta uploads, WebSocket push notifications, and CDN-accelerated downloads.

Binary delta sync is the most significant optimization. Even with 4MB block-level chunking, a small edit to a block still requires re-uploading the full 4MB. Delta sync computes a binary diff (using rsync-style rolling hash) between the old and new versions of a changed block, uploading only the changed bytes. For a typical text edit (100 bytes changed in a 4MB block), the delta is approximately 200 bytes — a 20,000x reduction from uploading the full block. Dropbox implemented this optimization in 2012 and reported bandwidth savings exceeding 95% for typical document editing workflows.

The delta sync protocol works as follows: (1) Client detects file change and splits into 4MB blocks with SHA-256 hashes. (2) Client compares new hashes with cached previous version to identify changed blocks. (3) For each changed block, client computes rolling-hash delta between old and new block versions. (4) Client uploads only the deltas to ChunkService. (5) ChunkService retrieves the base block from BlockStore (S3), applies the delta to reconstruct the new block, verifies SHA-256 integrity, and stores the new block. This adds one S3 read per delta upload but saves massive bandwidth.

WebSocket push replaces Kafka-based polling for sync notifications. Each connected device maintains a persistent WebSocket connection to NotificationService. When a file_changed event is published to SyncStream (Kafka), NotificationService pushes it to all connected devices for that user within 200ms. This provides sub-second sync awareness versus the 5-second polling delay in V0 or the variable Kafka consumer lag in V1. At 100M connected devices, NotificationService requires approximately 20 pods handling 5M connections each.

CDN-accelerated downloads serve content-addressed blocks from CloudFront edge caches. Since blocks are immutable (the SHA-256 hash never changes), they can be cached indefinitely at edge locations. For shared files (company documents, open-source project assets), CDN hit rate exceeds 60%, dramatically reducing both download latency (sub-10ms from edge versus 20ms+ from S3) and S3 egress costs.

The architecture separates upload and download into independent paths with their own load balancers and service pools. ChunkService handles the CPU-intensive upload path (delta reconstruction, SHA-256 verification). DownloadService handles the I/O-intensive download path (metadata lookup, pre-signed URL generation). This separation allows each path to scale independently based on its resource requirements.

Garbage collection uses a 72-hour grace period (extended from V1's 24 hours) to handle the longer tail of large delta uploads. GCWorker also manages version history cleanup — deleting file versions older than 30 days to control MetadataDB growth.

Interviewers at senior levels expect candidates to explain the delta sync protocol, discuss content-addressed block caching at CDN edge nodes, reason about WebSocket connection management at scale, analyze the trade-off between delta reconstruction overhead and bandwidth savings, and design the garbage collection strategy for sharded metadata.

Architecture Overview

The delta sync architecture uses eleven components organized into five layers: traffic entry (Client, ApiGateway), upload path (UploadLB, ChunkService), download path (DownloadLB, DownloadService, CDN), notification path (NotificationService consuming from SyncStream), data stores (MetadataDB, BlockStore), and async processing (SyncStream/Kafka, GCWorker).

The upload path handles file changes with delta optimization. Client detects a file change, splits into 4MB blocks, computes SHA-256 hashes, and compares with the cached previous version. For changed blocks, the client computes a binary delta (rsync rolling hash) between old and new block versions. The upload-init request sends all new hashes and delta sizes to ChunkService via ApiGateway and UploadLB. ChunkService checks MetadataDB for existing blocks (dedup check) and returns only the missing hashes. The client then uploads deltas for changed blocks — typically 1-10% of full block size. ChunkService retrieves the base block from BlockStore, applies the delta to reconstruct the new block, verifies SHA-256 integrity, and stores the reconstructed block in BlockStore. On finalize, ChunkService updates MetadataDB (sharded by user_id across 64 partitions) and publishes a file_changed event to SyncStream. 15 ChunkService pods handle the CPU-intensive delta reconstruction.

The download path serves file content with CDN acceleration. DownloadService (10 pods) receives requests via ApiGateway and DownloadLB, resolves file paths to block hash lists from MetadataDB, and generates pre-signed S3 URLs or CDN URLs for each block. The client downloads only blocks not in its local cache. CloudFront CDN edge caches serve popular blocks with sub-10ms latency — content-addressed blocks are immutable and cacheable indefinitely, achieving 60%+ hit rate for shared content.

The notification path provides sub-second sync awareness. NotificationService (20 pods) maintains persistent WebSocket connections with all connected devices. It consumes file_changed events from SyncStream (Kafka) and pushes them to the appropriate user's WebSocket connections. Notifications arrive within 200ms of the Kafka publish, compared to 5-second polling in V0. At 100M connected devices, NotificationService manages approximately 5M connections per pod.

MetadataDB (PostgreSQL) is sharded by user_id across 64 partitions. Each shard handles approximately 11M users at 700M total users. Sharding by user_id co-locates all of a user's file operations on the same shard, avoiding cross-shard queries. The blocks table is a global table (not sharded) since block dedup is cross-user.

BlockStore (S3) uses SHA-256 content hash as the S3 key. Blocks are immutable once written — never updated, only created and eventually deleted by GCWorker. This immutability enables aggressive CDN caching and eliminates cache invalidation complexity.

GCWorker (15 workers) consumes file_changed events from SyncStream, manages block reference counts in MetadataDB, and deletes orphaned blocks from BlockStore after a 72-hour grace period. Also handles version history cleanup (deleting versions older than 30 days).

Architecture Preview
Loading architecture preview...
Key Design Decisions
Binary Delta Sync (Rolling Hash)

Choice

Rsync-style rolling hash delta between old and new block versions

Rationale

A 100-byte edit to a 4MB block uploads only the delta (~200 bytes) instead of the full 4MB — a 20,000x bandwidth reduction. The rolling hash algorithm (Adler-32 or xxHash for fast computation, MD5 for strong matching) identifies matching regions between old and new block versions and encodes only the differences. The trade-off is: (1) client CPU for delta computation (~50ms per block), (2) server CPU for delta reconstruction (~30ms), and (3) one additional S3 read per delta to retrieve the base block. For typical document editing, the bandwidth savings far outweigh the compute overhead.

Separate Upload and Download Paths

Choice

Independent ChunkService (upload) and DownloadService (download) with separate LBs

Rationale

Uploads are write-heavy and CPU-intensive (delta reconstruction, SHA-256 verification, S3 writes). Downloads are read-heavy and I/O-intensive (metadata lookup, S3 reads, CDN URL generation). Separating them allows independent scaling: 15 ChunkService pods for upload throughput versus 10 DownloadService pods for read throughput. A combined service would need all pods sized for the heavier upload workload, wasting resources during read-heavy periods (which dominate — downloads outnumber uploads 3:1).

WebSocket Push Notifications

Choice

Persistent WebSocket connections to NotificationService replacing HTTP polling

Rationale

Polling every 5 seconds at 100M users generates 20M QPS of mostly-empty responses. WebSocket push delivers notifications in 200ms with zero wasted traffic — events are sent only when changes occur. The trade-off is connection management complexity: 100M persistent connections require approximately 20 pods with 5M connections each, rolling deployments to avoid connection storms, and exponential backoff for reconnection. But the bandwidth savings (eliminating 20M QPS) and the 10x faster notification (200ms versus 2.5s average polling delay) justify the complexity.

CDN for Content-Addressed Block Downloads

Choice

CloudFront edge caches for immutable content-addressed blocks

Rationale

Content-addressed blocks are ideal for CDN caching because they are immutable — the SHA-256 hash never changes, so cache invalidation is never needed. Popular shared files (company documents, open-source repos) achieve 60%+ CDN hit rate, serving downloads from 400+ edge locations with sub-10ms latency. S3 egress costs drop proportionally to CDN hit rate. The trade-off is CDN cost (approximately $0.085/GB for CloudFront versus $0.09/GB for S3 egress), which is offset by reduced S3 request costs at high hit rates.

Sharded MetadataDB (64 Partitions)

Choice

PostgreSQL sharded by user_id across 64 partitions with read replicas

Rationale

At 700M users, a single PostgreSQL instance cannot handle the write throughput for file metadata updates. Sharding by user_id distributes writes evenly — each shard handles approximately 11M users. All of a user's file operations (list, upload, download, version history) are co-located on the same shard, avoiding cross-shard joins. The blocks table remains global (not user-sharded) since dedup is cross-user — block lookups by hash are simple primary key queries that scale well even on a single instance.

72-Hour GC Grace Period

Choice

Orphaned blocks deleted 72 hours after ref_count reaches zero

Rationale

The grace period prevents data loss from race conditions between concurrent uploads and deletes. A 50GB file upload at 100 Mbps takes approximately 67 minutes. With delta sync, the upload may span multiple sessions over hours if the user is on a slow connection. The 72-hour grace period ensures even the slowest multi-session uploads complete before any referenced blocks are garbage collected. The storage cost of 72 hours of orphaned blocks is approximately 0.1% of total storage — a negligible premium for data safety.

Scale & Performance

Target RPS

25K peak (upload + download + notifications)

Latency (p99)

<0.1s delta upload (200 bytes), <2s sync notification, <50ms CDN download

Storage

3-5 PB at 1M users (30-50% dedup), scales to 1+ EB

Availability

99.95% (multi-AZ, CDN, Kafka replication)

Time & Space Complexity
OperationTimeSpaceNotes
Delta upload (rolling hash diff + reconstruction)O(B) — B is block size (4MB), linear scan for rolling hashO(B) — buffer for base block + delta + reconstructed blockClient computes delta in ~50ms per block (Adler-32 rolling hash). Server reconstructs in ~30ms. Total per-block overhead: ~80ms compute + 20ms S3 read for base block. Bandwidth savings: 95%+ for typical edits. Net win when delta size < 3.9MB (almost always).
Dedup check (upload-init)O(H) — H is number of block hashes in the fileO(H) — hash list in request/responseFor a 1GB file: H = 250. MetadataDB lookup by primary key (block_hash) is O(1) per hash. Total dedup check: 250 x O(1) = O(250) ~ 10-15ms. Cache hit on MetadataCache: 2ms per hash. Typical result: 248 existing + 2 new hashes.
WebSocket notification deliveryO(D) — D is number of connected devices for the userO(1) — single event pushed per deviceAverage user has 2-3 connected devices. Event routing via user_id lookup in connection registry (Redis hash): O(1). Total push latency: ~200ms from Kafka publish to WebSocket delivery.
CDN block download (cache hit)O(1) — direct edge cache lookup by SHA-256 hashO(1) — 4MB block contentContent-addressed blocks are immutable — cache hit rate increases monotonically as popular blocks propagate to edge locations. No invalidation logic needed. Cache miss falls through to S3 origin (~20ms additional latency).
Database Schema (HLD)
files (PostgreSQL, sharded by user_id)

File metadata mapping user+path to ordered block hashes. Sharded by user_id across 64 PostgreSQL partitions. Each shard handles approximately 11M users. Full version history retained for 30 days for rollback capability.

file_id UUID PKuser_id UUID (shard key)path TEXTblock_hashes TEXT[] (ordered SHA-256 list)version INTEGERsize_bytes BIGINTlast_modified TIMESTAMPTZ

Indexes: PK on file_id, UNIQUE on (user_id, path), idx_files_user ON (user_id)

Sharded by user_id so all file operations for a user hit the same shard. Block_hashes array is the core data — an ordered list of SHA-256 hashes. For a 1GB file: 250 entries (~16KB). Version history stored in file_versions table for 30-day rollback.

blocks (PostgreSQL, global)

Global block reference tracking table. Not sharded — block dedup is cross-user, so a single table serves all users. SHA-256 hash as primary key for O(1) lookups. Ref_count manages garbage collection lifecycle.

block_hash TEXT PK (SHA-256)ref_count INTEGERsize_bytes INTEGERcreated_at TIMESTAMPTZdelete_after TIMESTAMPTZ (null if ref_count > 0)

Indexes: PK on block_hash, idx_blocks_gc ON (delete_after) WHERE delete_after IS NOT NULL

The ref_count is the critical field. Incremented on upload-finalize, decremented on file delete or version expiration. When ref_count reaches 0, delete_after is set to now + 72 hours. GCWorker sweeps hourly for blocks where delete_after < now. Race condition safety: ref_count increment is atomic (UPDATE blocks SET ref_count = ref_count + 1).

file_versions (PostgreSQL, sharded by user_id)

Version history for file rollback. Stores previous block_hashes arrays per version. Retained for 30 days, then cleaned up by GCWorker. Enables server-side delta computation when client lacks the base version.

version_id UUID PKfile_id UUID FKversion INTEGERblock_hashes TEXT[]created_at TIMESTAMPTZ

Indexes: PK on version_id, idx_versions_file ON (file_id, version), idx_versions_cleanup ON (created_at) WHERE created_at < now() - interval '30 days'

Co-located with the files table on the same shard (same user_id). Growth: approximately 10 versions per file per month. GCWorker deletes rows older than 30 days and decrements ref_counts for the expired version's blocks.

file-changed (Kafka topic)

Kafka topic carrying file sync events. 64 partitions, partitioned by user_id. Consumed by NotificationService (WebSocket push) and GCWorker (garbage collection). 7-day retention for offline device catch-up.

user_id TEXT (partition key)file_path TEXTversion INTEGERchange_type TEXT (create/update/delete)block_hashes TEXT[] (for create/update)timestamp BIGINT

Indexes: Partitioned by user_id (64 partitions)

Two consumer groups: NotificationService (real-time WebSocket push, must keep lag < 1 second) and GCWorker (garbage collection, can tolerate lag up to 1 hour). 1M msg/sec capacity handles peak sync events.

Event Contracts
file_changedfile-changed

File sync events published by ChunkService on every file create, update, or delete. Consumed by NotificationService (WebSocket push to connected devices) and GCWorker (block reference count management and garbage collection).

Key Schema

user_id (string)

Value Schema

{ user_id: string, file_path: string, version: number, change_type: create|update|delete, block_hashes?: string[], delta_sizes?: number[], timestamp: number }

Solution Comparison
VariantTierLatencyThroughputCostComplexityReliability
V0: Naive (Whole-File Upload + Polling)T180s+ upload (1GB), 0-5s sync delay~1K RPS$500/monthLow99% (single DB)
V1: Block-Level Chunked (SHA-256 + Kafka)T20.3s upload (changed block), <5s sync25K RPS peak$3,000/monthMedium99.9% (multi-AZ)
V2: Delta Sync + Dedup (WebSocket + CDN)T3<0.1s delta upload, <2s sync25K RPS peak$8,000/monthHigh99.95% (multi-AZ, CDN)

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions
How does delta sync compare to block-level chunking in bandwidth savings?

Block-level chunking (V1) uploads the full 4MB block when any byte within it changes. Delta sync (V2) uploads only the changed bytes within the block. For a 100-byte text edit in a 4MB block: V0 (naive) uploads 1GB, V1 (chunked) uploads 4MB, V2 (delta) uploads approximately 200 bytes. The savings compound for files edited frequently — a user editing a document 10 times per day saves 40MB/day with chunking versus 2KB/day with delta sync. Dropbox reported that delta sync reduced total upload bandwidth by 95% compared to block-level chunking alone.

Why not use delta sync for downloads as well?

Delta sync for downloads is technically possible but adds complexity with diminishing returns. Downloads already benefit from block-level dedup — the client downloads only blocks it does not have locally. Applying delta on the download path would require the server to compute deltas between the client's block version and the new version, which means the server needs to know what the client already has. This per-client state management is expensive at 700M users. CDN caching provides comparable download optimization (60%+ cache hit rate) without per-client state.

How does WebSocket connection management work at 100M devices?

NotificationService maintains persistent WebSocket connections — approximately 5M per pod across 20 pods. Each connection consumes approximately 2KB of memory (connection state + user routing). At 5M connections: 10GB memory per pod. Connections are authenticated via JWT on upgrade and mapped to user_id for event routing. On pod restart (deployment), connections drop and clients reconnect with exponential backoff (0.5s to 30s jitter). Rolling deployment restarts 2-3 pods at a time, causing at most 15M reconnections. Connection registry in Redis maps user_id to pod for cross-pod routing when a single user has devices connected to different pods.

What happens when the CDN serves a stale block?

Content-addressed blocks cannot be stale — this is the fundamental advantage of content addressing for CDN caching. A block's SHA-256 hash is derived from its content. If the content changes, the hash changes, creating a new S3 key and a new CDN cache entry. The old block and the new block coexist independently in the CDN. There is no invalidation needed because the URL for a given block's content never changes. This is why content addressing is superior to path-based addressing for CDN-backed storage.

How does the system handle a user with 1 million small files?

A user with 1M small files (e.g., a developer syncing node_modules) creates 1M entries in the files table on their MetadataDB shard. Each entry is approximately 500 bytes (path + block hashes for a single block). Total metadata: 500MB on one shard. The initial sync (listing all files) requires a paginated SELECT returning 1M rows — at 1000 rows per page, this takes 1000 API calls over approximately 30 seconds. The V2 architecture mitigates this with a 'namespace tree' — syncing at the directory level with recursive hash trees (Merkle trees) so unchanged subtrees are skipped entirely.

Related Templates

Discussion

Sign in to join the discussion.

Ready to design your own Dropbox / File Sync?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator