Build a video delivery platform with adaptive bitrate streaming, CDN edge caching, transcoding pipelines, and content recommendation.
Video streaming is one of the most resource-intensive system design problems because it combines large-scale data processing (transcoding), global content distribution (CDN), and real-time adaptive delivery (ABR streaming). Building a platform like Netflix or YouTube requires designing a system that can ingest, process, store, and deliver video content to millions of concurrent viewers with minimal buffering and latency across diverse network conditions and device capabilities.
At Netflix's scale, the platform serves over 200 million subscribers streaming an average of 2 hours per day, consuming over 15% of global internet bandwidth during peak hours. Each piece of content is transcoded into hundreds of renditions: multiple resolutions (480p, 720p, 1080p, 4K), multiple bitrates per resolution, and multiple codec formats (H.264, H.265, VP9, AV1). A single 2-hour movie might require 1TB of storage across all its renditions, and the transcoding pipeline must process new content within hours of ingestion.
The delivery challenge is equally formidable. Viewers expect instant playback start (under 2 seconds), zero buffering during playback, and seamless quality adaptation as network conditions change. This requires a global CDN with edge servers positioned close to viewers, intelligent client-side ABR algorithms that predict bandwidth and switch quality levels proactively, and origin servers that can handle cache misses without impacting the viewing experience.
This template models the end-to-end video platform: content ingestion service, transcoding pipeline with parallel processing, content delivery network with edge caching, ABR streaming server, recommendation engine, and analytics pipeline. The simulation shows how CDN cache hit rates affect origin server load, how transcoding parallelism reduces processing time, and how ABR algorithms respond to simulated network degradation.
The video streaming architecture is divided into two major subsystems: the content preparation pipeline (offline) and the content delivery system (real-time). On the preparation side, the Content Ingestion Service receives raw video files (often in ProRes or MXF format at 50+ Mbps), validates them, and submits transcoding jobs to the Transcoding Pipeline. This pipeline is the most compute-intensive component: it splits each video into segments (typically 2-6 seconds), distributes segments across a fleet of GPU-equipped workers for parallel transcoding, and produces the full rendition ladder (multiple resolution/bitrate/codec combinations).
Transcoded segments are stored in Object Storage (S3) organized by content ID, rendition, and segment number. A manifest file (HLS .m3u8 or DASH .mpd) is generated for each piece of content, listing all available renditions and their segment URLs. This manifest is the entry point for client playback.
On the delivery side, the CDN (Content Delivery Network) is the critical component. Edge servers distributed across hundreds of points of presence (PoPs) worldwide cache video segments close to viewers. When a viewer starts playback, their player fetches the manifest from the origin, then requests segments from the nearest edge server. If the segment is cached (cache hit), it is served directly with single-digit millisecond latency. On a cache miss, the edge server fetches from a regional mid-tier cache or the origin, caches the segment locally, and serves it to the viewer.
The ABR (Adaptive Bitrate) algorithm runs on the client (player). It monitors the download speed of each segment and the playback buffer level, then selects the highest quality rendition that can be downloaded faster than real-time playback. When network conditions degrade, the player steps down to a lower bitrate to avoid buffering. When conditions improve, it steps up for better quality. Modern ABR algorithms use buffer-based approaches (BBA) combined with bandwidth estimation for more stable quality switching.
The Recommendation Engine analyzes viewing history, content metadata, and collaborative filtering signals to personalize the content catalog for each user. Recommendations are pre-computed in batch (offline, every few hours) and served from a fast key-value store, with real-time adjustments based on the current session's viewing behavior.
The video streaming system operates in two distinct phases: an offline ingestion/transcoding pipeline that prepares content, and a real-time playback path that delivers video segments to viewers. The CDN is the hero of the read path — at scale, 95%+ of segment requests are served from edge caches with single-digit millisecond latency, meaning the origin servers handle only the long tail of unpopular content.
The transcoding pipeline is the most compute-intensive component. A single 4K video produces a full rendition ladder: 2160p, 1440p, 1080p, 720p, 480p, 360p — each at multiple bitrates. The video is split into 2-6 second segments (the fundamental unit of adaptive streaming), and each segment is encoded independently across a fleet of GPU workers. A 2-hour movie at 6 renditions produces ~7,200 segments.
Adaptive Bitrate (ABR) streaming is what makes the viewing experience smooth. The client-side ABR algorithm continuously monitors download speed and buffer level. If the network degrades, the player seamlessly steps down to a lower rendition (e.g., 1080p → 720p) without interrupting playback. If conditions improve, it steps back up. The manifest file (.m3u8 for HLS, .mpd for DASH) tells the player where to find each segment at each quality level.
Step-by-Step Walkthrough
Pseudocode
// Transcoding Pipeline — parallel segment encoding
async function transcodeVideo(contentId, rawVideoPath):
// 1. Split into segments
segments = await ffmpegSplit(rawVideoPath, segmentDuration: 4)
// e.g., 2-hour movie → 1,800 segments
// 2. Encode each segment at every rendition (parallel)
renditions = ["2160p", "1080p", "720p", "480p", "360p", "240p"]
jobs = []
for segment in segments:
for rendition in renditions:
jobs.push(gpuWorkerPool.submit(
encodeSegment, segment, rendition, bitrateMap[rendition]
))
await Promise.all(jobs) // 1,800 × 6 = 10,800 encode jobs
// 3. Store in S3
for job in jobs:
await s3.putObject(
`${contentId}/${job.rendition}/segment_${job.index}.ts`,
job.output
)
// 4. Generate manifest
manifest = generateHLSManifest(contentId, renditions, segments.length)
await s3.putObject(`${contentId}/master.m3u8`, manifest)
// Client-side ABR — adaptive quality selection
function selectRendition(manifest, bandwidth, bufferLevel):
// Sort renditions by bitrate descending
renditions = manifest.renditions.sortBy(r => -r.bandwidth)
// Select highest rendition that fits within 80% of available bandwidth
for rendition in renditions:
if rendition.bandwidth < bandwidth * 0.8:
// Buffer-based safety: if buffer < 10s, step down one more
if bufferLevel < 10 && rendition !== renditions.last:
return renditions[renditions.indexOf(rendition) + 1]
return rendition
return renditions.last // Lowest quality as fallbackChoice
Segment-parallel transcoding with per-title encoding
Rationale
Splitting video into segments and transcoding them in parallel across a GPU fleet reduces wall-clock processing time from hours to minutes. Per-title encoding optimizes the bitrate ladder for each piece of content — an animated movie compresses more efficiently than a live-action sports event at the same perceptual quality — reducing storage costs by 20-50% while maintaining visual quality.
Choice
Three-tier caching: edge PoP, regional mid-tier, origin shield
Rationale
A single cache tier results in frequent origin hits for less popular content (the long tail). Three tiers progressively aggregate demand: edge PoPs serve the most popular segments, regional mid-tiers handle moderate demand, and an origin shield protects the origin from thundering herd effects. This architecture achieves 95%+ cache hit rates at the edge for popular content while limiting origin load to a few percent of total traffic.
Choice
HLS with CMAF segments for broad device compatibility
Rationale
HLS (HTTP Live Streaming) is supported by virtually all devices and browsers. CMAF (Common Media Application Format) segments are compatible with both HLS and DASH manifests, enabling a single set of encoded segments to serve all clients. HTTP-based delivery leverages existing CDN infrastructure without requiring specialized streaming servers.
Choice
Hybrid collaborative filtering + content-based with batch pre-computation
Rationale
Collaborative filtering captures taste patterns across users (people who watched X also watched Y). Content-based filtering matches based on genre, cast, and metadata. The hybrid approach covers both scenarios: collaborative filtering works well for users with viewing history, while content-based handles the cold-start problem for new users. Batch pre-computation avoids expensive model inference on every page load.
Target RPS
1M concurrent streams
Latency (p99)
<2s (playback start)
Storage
~100 PB (all renditions)
Availability
99.99%
This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.
Adaptive bitrate (ABR) streaming encodes each video in multiple quality levels (renditions). The client player monitors download speed and buffer levels, then dynamically selects the highest quality rendition that can be downloaded faster than real-time playback. When bandwidth drops (e.g., switching from Wi-Fi to cellular), the player requests lower-quality segments to avoid buffering. When bandwidth recovers, it requests higher-quality segments for a better viewing experience.
Netflix uses per-title encoding where the bitrate ladder is optimized for each piece of content. An animated movie compresses much more efficiently than a live-action sports event. Additionally, different codecs (H.264, H.265, VP9, AV1) have different device compatibility and compression efficiency. Multiplying content-specific bitrate ladders by supported codecs produces hundreds of renditions per title, totaling roughly 1TB of storage per 2-hour movie.
A CDN caches video segments on edge servers distributed across hundreds of geographic locations (Points of Presence). When a viewer in Tokyo requests a video, the segment is served from a nearby Tokyo PoP rather than from the origin server in the US. This reduces round-trip time from hundreds of milliseconds to single-digit milliseconds. The CDN also absorbs traffic spikes (e.g., a new popular show release) that would overwhelm the origin.
Netflix minimizes startup latency through several techniques: (1) Starting playback at a low bitrate to fill the initial buffer quickly, then ramping up quality. (2) Pre-fetching the first few segments of likely-to-be-watched content while the user is browsing. (3) Using short segment durations (2 seconds) so the player can start after downloading just one segment. (4) Edge caching of popular content ensures the first segments are served from nearby servers with minimal latency.
HLS (HTTP Live Streaming) was developed by Apple and uses .m3u8 manifest files and .ts or .fmp4 segments. DASH (Dynamic Adaptive Streaming over HTTP) is an open standard using .mpd manifests and .mp4 segments. Both protocols segment video into chunks and support adaptive bitrate switching. HLS has broader device support (especially Apple devices), while DASH offers more flexibility in segment formats. CMAF (Common Media Application Format) unifies segment formats so that a single set of encoded segments can serve both protocols.
Sign in to join the discussion.
Ready to design your own Video Streaming (Netflix/YouTube)?
Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.
Open Simulator