Advanced multi-region video streaming pipeline with separate upload and playback paths, GPU-accelerated transcoding to 5 HLS resolutions (240p through 4K), Cassandra for globally replicated video metadata, multi-region CDN with automatic origin failover, and dedicated thumbnail generation workers. Designed for 100M+ concurrent viewers at 100+ Tbps egress.
The adaptive multi-region approach to video streaming represents the most advanced architecture for global VOD platforms at YouTube and Netflix scale. It extends the V1 CDN + async transcode architecture with three critical improvements: multi-region origin failover, globally replicated metadata, and separate upload/playback scaling paths.
The first advancement is multi-region CDN with origin failover. The V1 architecture uses a single S3 origin in us-east-1. If S3 experiences a regional outage (which has happened multiple times in AWS history), all CDN cache misses fail globally — even though 95% of traffic is served from edge cache, the 5% that needs origin is completely blocked. New or unpopular videos with cold caches become unavailable. The V2 architecture replicates transcoded segments to a secondary S3 bucket in eu-west-1 using cross-region replication. CloudFront is configured with an origin group: primary origin in us-east-1, failover origin in eu-west-1. When the primary returns 5xx or times out, CDN automatically routes the request to the failover origin on the next retry — transparent to the viewer. Since HLS segments are immutable (once transcoded, they never change), cross-region replication is straightforward and eventually consistent.
The second advancement is Cassandra for globally replicated metadata. The V1 architecture uses PostgreSQL for video metadata (title, manifest URL, view count, status). PostgreSQL replication across regions is complex (logical replication with conflict resolution) and introduces 50-200ms lag. Cassandra provides native multi-datacenter replication with tunable consistency. Using LOCAL_QUORUM reads, each region reads from its own datacenter in 2-5ms with no cross-region hop. Writes use LOCAL_QUORUM as well, and Cassandra's last-write-wins conflict resolution handles the rare case of concurrent updates to the same video from different regions. The trade-off is eventual consistency: a video's status may show 'processing' for 1-3 seconds after transcoding completes in a remote region. For video metadata, this is perfectly acceptable.
The third advancement is separate upload and playback paths. The V1 architecture routes both upload and catalog traffic through a single load balancer and shared API Gateway. A creator event (product launch, live announcement followed by VOD upload) can spike upload traffic 10x, degrading catalog browsing for all viewers. The V2 architecture uses dedicated UploadLB and PlaybackLB with separate service pools. UploadService (8 pods) handles presigned URL generation, multipart upload coordination, and Kafka publishing. ManifestService (20 pods) handles catalog browsing, manifest URL resolution, and thumbnail serving. These scale independently — a creator upload surge does not affect viewer browsing latency.
Additional improvements include GPU-accelerated transcoding (3-5x faster than CPU-only, reducing time-to-playability from 30 minutes to 3-10 minutes), dedicated ThumbnailWorker (extracts key frames and generates thumbnails at multiple sizes independently of video transcoding), 6-second HLS segments (enabling faster quality switching compared to the V1's 10-second segments), and Redis-cached thumbnail metadata for sub-5ms browse responses.
At full scale, the system handles 100M+ concurrent viewers, 500 hours of video uploaded per minute, 100+ Tbps of CDN egress, and stores petabytes of HLS segments across multiple regions. The 12-component architecture is significantly more complex than V1's 10 components, requiring dedicated teams for CDN configuration, Cassandra operations, GPU fleet management, and cross-region replication monitoring.
Interviewers expect candidates to justify multi-region architecture over single-region, explain Cassandra's trade-offs versus PostgreSQL for global metadata, discuss CDN origin failover mechanics, and reason about the cost-benefit of GPU transcoding versus CPU-only.
The adaptive multi-region architecture uses twelve components organized into five layers: traffic ingestion (ViewerClient, UploaderClient, ApiGateway), upload path (UploadLB, UploadService), playback path (PlaybackLB, ManifestService), data stores (VideoMetadataDB/Cassandra, ThumbnailCache/Redis, ObjectStorage/S3), async pipeline (TranscodeStream/Kafka, TranscodeWorker, ThumbnailWorker), and edge delivery (CDN/CloudFront with multi-origin failover).
The upload path is optimized for creator throughput. UploaderClient sends POST /api/v1/videos/upload-init to UploadService via ApiGateway and UploadLB. UploadService creates a video record in VideoMetadataDB (Cassandra with LOCAL_QUORUM write) and generates presigned S3 URLs for multipart upload. The client uploads 100MB chunks directly to S3. On upload completion, UploadService publishes a transcode-job event to TranscodeStream (Kafka, 16 partitions). The upload path is complete in seconds.
The transcoding pipeline uses GPU-accelerated workers. TranscodeWorker (16 pods with GPU) consumes transcode-job events and runs FFmpeg with NVENC hardware encoding. Each video is transcoded to 5 resolutions in parallel (one resolution per GPU stream): 240p (0.5 Mbps), 480p (1.5 Mbps), 720p (3 Mbps), 1080p (5 Mbps), 4K (15 Mbps). Each resolution is split into 6-second HLS segments. Per-variant .m3u8 playlists and a master manifest are generated. All outputs are uploaded to S3 in us-east-1, and S3 cross-region replication copies them to eu-west-1 for origin failover. ThumbnailWorker (8 CPU pods) consumes the same Kafka events, extracts key frames, and generates thumbnails at 3 sizes. Thumbnails complete in 30-60 seconds — much faster than video transcoding.
The playback path is optimized for viewer latency. ViewerClient fetches HLS manifests and segments from CDN (CloudFront, 96% cache hit rate). For catalog browsing, requests go through ApiGateway and PlaybackLB to ManifestService (20 pods). ManifestService checks ThumbnailCache (Redis, 92% hit rate) for metadata and thumbnail URLs. On cache miss, it queries VideoMetadataDB (Cassandra LOCAL_QUORUM, 5ms). CDN serves all video content from 400+ edge locations. Multi-origin configuration routes cache misses to the primary origin (us-east-1 S3) with automatic failover to the secondary origin (eu-west-1 S3) on 5xx errors.
Cassandra provides globally replicated metadata with single-digit millisecond reads. The videos table is partitioned by video_id with LOCAL_QUORUM consistency. Multi-datacenter replication (us-east-1 + eu-west-1) ensures metadata is available in both regions. Replication lag is typically under 1 second. Counter columns track view counts with eventual consistency — acceptable for a non-critical metric.
Redis ThumbnailCache (6 nodes, 26GB each) caches video metadata and thumbnail URLs with a 600-second TTL and 92% hit rate. The higher TTL (600s vs V1's 300s) is justified by the larger working set (10M+ videos) and the lower sensitivity of thumbnail URLs to staleness. Cache invalidation triggers on transcode completion (status change from processing to ready).
This diagram shows the four primary flows: presigned URL upload, GPU-accelerated async transcoding with thumbnail generation, CDN-served HLS adaptive playback, and CDN origin failover on regional outage. The key advancement over V1 is the separation of upload and playback paths, GPU transcoding, and multi-origin CDN failover.
Choice
CloudFront origin group with primary (us-east-1) and failover (eu-west-1) S3 origins
Rationale
A single-origin CDN has a single point of failure. In the V1 architecture, an S3 outage in us-east-1 makes all cache misses fail globally — 5% of total traffic. The V2 architecture replicates transcoded segments to eu-west-1 via S3 cross-region replication. CloudFront's origin group automatically routes to the failover origin on 5xx or timeout from the primary. Since HLS segments are immutable, cross-region replication is reliable and the failover is transparent to viewers. The cost is doubled storage and cross-region data transfer ($0.02/GB), but the reliability improvement from 99.9% to 99.99% justifies this at YouTube/Netflix scale.
Choice
Amazon Keyspaces (Cassandra-compatible) instead of PostgreSQL for video metadata
Rationale
At 100M+ viewers across multiple regions, metadata reads must be served locally. PostgreSQL cross-region replication introduces 50-200ms lag and complex conflict resolution. Cassandra provides native multi-datacenter replication with LOCAL_QUORUM reads serving from the nearest datacenter in 2-5ms. The trade-off is eventual consistency: metadata may be stale for 1-3 seconds after an update in a remote region. For video metadata (title, view count, manifest URL), this is acceptable. For financial or access control data, it would not be.
Choice
Dedicated UploadLB/UploadService and PlaybackLB/ManifestService with independent scaling
Rationale
Upload traffic (presigned URL generation, multipart coordination, Kafka publishing) has different resource requirements than playback traffic (cache lookups, metadata reads, manifest resolution). A creator event can spike uploads 10x without affecting viewer browsing. Independent scaling: 8 upload pods vs 20 playback pods. The V1 architecture shared a single LB and had VideoService and CatalogService behind it — better than a monolith but still coupled at the LB layer.
Choice
NVENC hardware encoding on GPU-capable instances instead of CPU-only FFmpeg
Rationale
GPU transcoding is 3-5x faster than CPU-only for H.264/H.265 encoding. A 1-hour 4K video takes 60 minutes on CPU versus 15 minutes on GPU. Faster transcoding means faster time-to-playability for creators. GPU instances cost 3-5x more per hour but complete in 1/3-1/5 the time, making the total cost roughly equivalent. The real benefit is throughput: 16 GPU workers handle the same volume as 60-80 CPU workers, dramatically simplifying fleet management.
Choice
Separate CPU workers for thumbnail generation instead of bundling with video transcoding
Rationale
Thumbnail generation (frame extraction, resize, JPEG compression) is CPU-bound but lightweight — 30-60 seconds for a 1-hour video versus 15-60 minutes for video transcoding. Running thumbnails on GPU workers wastes expensive GPU resources. Separate ThumbnailWorker pods (2 vCPU, standard instances) are 5x cheaper per thumbnail. Thumbnails are available for the browse experience within 1 minute of upload, even while video transcoding continues for another 15+ minutes.
Choice
6-second segments instead of the V1's 10-second segments
Rationale
Shorter segments enable faster adaptive bitrate switching. The player can react to bandwidth changes every 6 seconds instead of 10, reducing buffering duration during network degradation. Netflix uses 4-6 second segments for this reason. The trade-off is 60% more segment files per video (more S3 objects, more CDN cache entries) and slightly higher overhead from segment container headers. At CDN scale, the additional cache entries are negligible.
Target RPS
200K peak (150K playback via CDN, 30K catalog, 10K uploads, 10K thumbnails)
Latency (p99)
<800ms playback start, 3-20 min GPU transcode, <10ms metadata reads
Storage
~150 TB/day (5 resolutions + thumbnails, multi-region replicated)
Availability
99.99% (multi-region, CDN origin failover, Cassandra replication)
| Operation | Time | Space | Notes |
|---|---|---|---|
| Upload initiation (presigned URL generation) | O(1) — S3 SDK HMAC-SHA256 signing | O(C) — C = number of chunks (file_size / 100MB) | 2ms per URL. 10GB file = 100 chunks = 100 URLs generated in ~200ms. Returned as a JSON array. |
| GPU transcoding (5 HLS tiers, parallel per resolution) | O(N x M / G) — N = duration, M = total pixels, G = GPU parallelism | O(N) — intermediate frames buffered per resolution stream | 1-hour 4K video: ~15 minutes on GPU (vs 60 minutes CPU). 5 resolutions processed in parallel on GPU streams. 6-second segment output. |
| Metadata read (Redis cache hit) | O(1) — Redis GET by key | O(1) — ~3KB per cached entry (includes thumbnail URLs map) | 1ms latency, 92% hit rate. Covers both video metadata and thumbnail URLs in a single cache entry. |
| Metadata read (Cassandra cache miss) | O(1) — partition key lookup with LOCAL_QUORUM | O(1) — single row read | 5ms latency from nearest datacenter. No cross-region hop. 8% of total metadata reads. |
| CDN segment delivery with origin failover | O(1) edge cache lookup + O(1) origin fetch on miss | O(S) — segment size (~1.5MB for 6-second 1080p) | 4ms from edge (96% hit). Origin fetch: 8ms primary, 88ms failover (80ms cross-region). Failover is automatic and transparent. |
Video metadata partitioned by video_id with multi-region replication. Written on upload init (status=uploading), updated on transcode completion (manifest_url, thumbnail_urls, status=ready). Read on every catalog query and manifest resolution. LOCAL_QUORUM consistency for strong-enough reads within each region.
Partition: video_id
Indexes: Partition key on video_id, Secondary index on status (for admin queries)
Cassandra counter columns handle view_count increments without read-modify-write. Multi-datacenter replication ensures metadata is available in both us-east-1 and eu-west-1 with eventual consistency (typically under 1 second). thumbnail_urls is a Cassandra MAP storing {small: url, medium: url, large: url}.
Channel (creator) metadata. Low write volume. Read for channel pages and subscriber counts. Partitioned by channel_id with multi-region replication.
Partition: channel_id
Indexes: Partition key on channel_id
Counter columns for subscriber_count and video_count. Small table — fully cached in Redis for browse experience.
Published by UploadService after upload completion. Consumed by TranscodeWorker (GPU transcoding to 5 HLS tiers) and ThumbnailWorker (frame extraction + resize). 16 partitions for worker parallelism. 7-day retention for replay.
Key Schema
video_id (string)
Value Schema
{ video_id: string, s3_key: string, file_size: integer, output_prefix: string, resolutions: string[] }
Primary origin (us-east-1 S3) outage — multi-region failover activates
Impact
CDN continues serving 96% of traffic from edge cache. The 4% of requests that need origin failover to eu-west-1 S3 with 80ms additional latency. Total viewer impact: 4% of segment requests see ~88ms instead of ~8ms origin fetch time. New uploads to us-east-1 fail until the outage resolves.
Mitigation
CloudFront origin group handles failover automatically. Cross-region replication ensures eu-west-1 has all segments (with eventual consistency lag of seconds to minutes for very recent content). Upload path can be redirected to eu-west-1 via DNS failover for the UploadService.
Cassandra replication lag spike (metadata stale for 10+ seconds across regions)
Impact
Viewers in eu-west-1 see recently uploaded videos as 'processing' for 10+ seconds after they became 'ready' in us-east-1. The viewer retries the page load and sees the correct status. No data loss — only staleness. Playback of existing videos is unaffected (CDN-served, no metadata dependency during segment fetch).
Mitigation
Monitor Cassandra cross-datacenter replication lag. Alert at 5 seconds. If persistent, increase Cassandra write consistency to EACH_QUORUM (writes acknowledged in all datacenters) at the cost of higher write latency (80ms+ instead of 15ms).
GPU instance shortage (insufficient capacity for TranscodeWorker auto-scaling)
Impact
Transcoding backlog grows in Kafka. New videos take hours instead of minutes to become playable. Creators see extended 'processing' times. Viewer playback of existing videos is completely unaffected.
Mitigation
Multi-instance-type TranscodeWorker configuration: prefer g5.2xlarge, fall back to g4dn.xlarge, then to CPU-only c7g.4xlarge (4x slower but always available). Priority queues in Kafka: short videos (under 5 min) get the fast lane. Spot instances for non-urgent transcoding with on-demand instances for the priority queue.
ThumbnailCache (Redis) failure — all metadata reads hit Cassandra
Impact
Metadata read latency increases from 1ms (Redis) to 5ms (Cassandra) for 92% of requests. ManifestService can handle the additional Cassandra load because Cassandra is designed for high read throughput. Browse experience latency increases by ~4ms — noticeable but not critical. Playback via CDN is unaffected.
Mitigation
Redis Cluster with 6 nodes across 3 AZs for HA. On total failure, ManifestService has a local in-memory L1 cache (100MB, 30-second TTL) that absorbs the hottest keys. Cassandra handles the remaining load without degradation at this volume.
| Component | Failure | Impact | Mitigation |
|---|---|---|---|
| CloudFront CDN (Edge Location) | Individual edge location degradation | Viewers in the affected region experience 100-200ms additional latency as traffic reroutes to the next nearest edge. CloudFront handles this automatically. No viewer-visible error — just slightly increased latency. | CloudFront has built-in redundancy. No operator action needed. Multi-region origin failover protects against origin-level failures. |
| VideoMetadataDB (Cassandra) | Single datacenter failure (us-east-1 Cassandra nodes) | Reads in us-east-1 fail if LOCAL_QUORUM cannot be satisfied (need 2 of 3 replicas in the local DC). eu-west-1 reads continue normally. Uploads writing to us-east-1 fail. | Cassandra with RF=3 per datacenter tolerates 1 node failure. On multi-node failure, downgrade read consistency to LOCAL_ONE (single node, risk of stale reads) while repairing. Redirect uploads to eu-west-1 via DNS failover. |
| TranscodeWorker (GPU Pool) | Worker crash during transcoding | Partial segments in S3 from the crashed job. Kafka offset not committed — job redelivered to another worker. New worker starts from scratch (no checkpointing), wasting partial work. | Idempotent transcoding: check S3 for existing segments before re-processing each resolution. Resume from the last incomplete resolution. Dead letter queue for jobs failing 3+ times. |
| TranscodeStream (Kafka) | Kafka cluster failure (all brokers) | New uploads succeed (video bytes in S3) but transcode-job events cannot be published. Videos remain in 'uploaded' status indefinitely. No new transcoding starts. | UploadService outbox pattern: write events to a Cassandra outbox table in the same write as the video record. A recovery sweeper publishes outbox events when Kafka recovers. 7-day retention ensures no event loss on Kafka broker recovery. |
Per-component auto-scaling: ManifestService scales on CPU (target 50%, scale at 65%). UploadService scales on request rate (target 5K/sec per pod). TranscodeWorker scales on Kafka consumer lag (add GPU worker per 100 pending jobs). Redis: vertical scaling within r7g family. Cassandra: add nodes to the ring with zero-downtime rebalancing. CDN: automatic, no configuration needed. Cross-region: add a third region (ap-southeast-1) by adding Cassandra datacenter, S3 bucket with replication, and CloudFront origin group member. The architecture supports 500M+ concurrent viewers with proportional infrastructure scaling.
Key metrics: (1) CDN cache hit rate — target 96%+. Alert below 92%. A drop indicates either cache eviction pressure (working set exceeds edge cache capacity) or a shift in traffic pattern toward long-tail content. (2) CDN origin failover rate — should be 0% normally. Any non-zero rate indicates primary origin issues. Alert at >0.1%. (3) Cassandra cross-DC replication lag — target under 1 second. Alert at 5 seconds. (4) GPU transcoding queue depth (Kafka consumer lag) — alert at 500+ pending jobs (~2 hours of backlog). (5) ThumbnailCache hit rate — target 92%+. Alert below 85%. (6) Upload initiation p99 latency — target under 200ms. Alert at 500ms. (7) ManifestService p99 response time — target under 50ms. Alert at 100ms. Dashboard: Grafana with panels for CDN egress by edge location, Kafka consumer lag per topic, Cassandra latency per datacenter, Redis hit rate, and GPU utilization per worker.
At 100M concurrent viewers: CloudFront CDN egress (~$60,000/month with reserved capacity), GPU TranscodeWorkers 16 pods (~$8,000/month), UploadService 8 pods ($800/month), ManifestService 20 pods ($2,000/month), Cassandra (Keyspaces) on-demand (~$3,000/month), Redis 6 nodes ($1,500/month), Kafka 3 brokers ($900/month), S3 storage 150TB/day x 30 = 4.5PB ($100,000/month), S3 cross-region replication ($90,000/month). Total: ~$266,200/month. Compare: V1 at 10M viewers costs $22K/month; V2 at 100M viewers costs $266K/month — a 10x viewer increase for a 12x cost increase. The marginal cost per viewer decreases at scale due to CDN caching efficiency (96% vs 95%) and GPU transcoding amortization.
Presigned URL security: URLs signed with SigV4, scoped to specific S3 keys, expire after 1 hour. CDN signed cookies for authenticated playback (premium/subscriber content). Cassandra encryption at rest (AWS managed keys) and in transit (TLS). Cross-region replication encrypted in transit. Content moderation pipeline (post-transcode, pre-publish) scans thumbnails and sample segments for policy violations. DRM integration via Widevine/FairPlay for premium content (encrypted HLS segments). Rate limiting at API Gateway: 10 uploads/min per user, 300K total RPS. Geo-blocking via CloudFront geo-restriction for content licensing compliance.
Blue/green deployment for UploadService and ManifestService via ALB target group switching. TranscodeWorker: rolling replacement with Kafka consumer rebalancing. Cassandra schema changes via CQL with additive-only policy (never drop columns in production). CDN: CloudFront configuration propagation to all edge locations in 5-15 minutes. Cross-region: deploy to us-east-1 first, validate, then deploy to eu-west-1. Canary deployment: route 5% of API traffic to new version, monitor error rates for 30 minutes, then promote to 100%.
| Variant | Tier | Latency | Throughput | Cost | Complexity | Reliability |
|---|---|---|---|---|---|---|
| V0: Naive (Single Bitrate, No CDN) | T1 | 15-45 min upload, 50-200ms playback redirect | ~10K RPS (limited by 40 threads) | $146K/month at 1K viewers (S3 egress dominated) | Low | 99% (single DB, no CDN) |
| V1: CDN + Async Transcode (HLS + CloudFront) | T2 | ~2s upload init, 5-30 min async transcode, <1s playback | 100K RPS peak | $22K/month at 10M viewers (CDN egress) | Medium | 99.9% (multi-AZ, CDN) |
| V2: Adaptive Multi-Region (HLS + Edge + Cassandra) | T3 | <1s upload init, 3-20 min GPU transcode, <800ms playback | 200K RPS peak | $266K/month at 100M viewers (multi-region CDN + GPU) | Very High | 99.99% (multi-region, origin failover) |
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
PostgreSQL provides strong consistency but struggles with multi-region replication. Logical replication across regions introduces 50-200ms lag and requires complex conflict resolution for concurrent writes. Cassandra provides native multi-datacenter replication with tunable consistency. LOCAL_QUORUM reads serve from the nearest datacenter in 2-5ms — no cross-region hop. For video metadata (title, view count, manifest URL), eventual consistency with 1-3 second staleness is acceptable. Cassandra also handles the write volume better: counter columns for view counts avoid the read-modify-write pattern that PostgreSQL requires for atomic increments at 100K+ writes/sec.
CloudFront origin groups define a primary and failover origin. When a viewer requests a segment that is not in the edge cache, CloudFront routes the request to the primary origin (us-east-1 S3). If the primary returns a 5xx error or times out (configurable threshold, typically 3 seconds), CloudFront retries the request against the failover origin (eu-west-1 S3). Since HLS segments are immutable and cross-region replicated, the failover origin serves the identical content. The failover adds 80ms of cross-region latency for the affected requests but prevents a total outage. At 96% cache hit rate, only 4% of traffic ever reaches the origin, so the failover latency impact is minimal.
Upload and playback traffic have fundamentally different profiles. Uploads are bursty (creator events, product launches), CPU-intensive (presigned URL signing, multipart coordination), and write-heavy (Cassandra writes, Kafka publishes). Playback is steady (diurnal pattern), read-intensive (cache lookups, metadata reads), and latency-sensitive (viewers expect sub-second responses). Sharing infrastructure means an upload surge degrades viewer experience. Separate paths with independent load balancers and service pools allow each to scale according to its own traffic pattern. UploadService runs 8 pods sized for write operations; ManifestService runs 20 pods optimized for read throughput.
S3 cross-region replication transfers every object written to us-east-1 to eu-west-1. At 150TB/day of new transcoded content: cross-region transfer cost = 150,000 GB x $0.02/GB = $3,000/day = ~$90,000/month. Storage cost doubles (storing segments in two regions). Total additional cost for multi-region: approximately $180,000/month. This is justified by the reliability improvement: a regional S3 outage without failover would cause 5% of total traffic to fail for the outage duration (typically 1-4 hours). At 100M viewers, 5% = 5M viewers experiencing failures. The revenue impact of a 4-hour outage for 5M viewers far exceeds $180K/month.
CPU transcoding (c7g.4xlarge, 16 vCPU): 1-hour 4K video takes approximately 60 minutes. Cost: $0.576/hour x 1 hour = $0.576 per video. GPU transcoding (g5.2xlarge, NVIDIA A10G): same video takes approximately 15 minutes. Cost: $1.212/hour x 0.25 hours = $0.303 per video. GPU is 47% cheaper per video and 4x faster. The speed advantage compounds: 16 GPU workers process the same throughput as 64 CPU workers, reducing fleet management complexity by 4x. The trade-off is that GPU instances have less scheduling flexibility (fewer instance types, fewer AZs with GPU capacity) and require CUDA/NVENC configuration.
Sign in to join the discussion.
Ready to design your own Video Streaming?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator