Hard7 componentsInterview: Medium

Video Streaming (Netflix/YouTube)

Q: How does adaptive bitrate streaming work?

Adaptive bitrate (ABR) streaming encodes each video in multiple quality levels (renditions). The client player monitors download speed and buffer levels, then dynamically selects the highest quality rendition that can be downloaded faster than real-time playback. When bandwidth drops (e.g., switching from Wi-Fi to cellular), the player requests lower-quality segments to avoid buffering. When bandwidth recovers, it requests higher-quality segments for a better viewing experience.

Q: Why does Netflix encode each video in hundreds of renditions?

Netflix uses per-title encoding where the bitrate ladder is optimized for each piece of content. An animated movie compresses much more efficiently than a live-action sports event. Additionally, different codecs (H.264, H.265, VP9, AV1) have different device compatibility and compression efficiency. Multiplying content-specific bitrate ladders by supported codecs produces hundreds of renditions per title, totaling roughly 1TB of storage per 2-hour movie.

Q: How does a CDN reduce video streaming latency?

A CDN caches video segments on edge servers distributed across hundreds of geographic locations (Points of Presence). When a viewer in Tokyo requests a video, the segment is served from a nearby Tokyo PoP rather than from the origin server in the US. This reduces round-trip time from hundreds of milliseconds to single-digit milliseconds. The CDN also absorbs traffic spikes (e.g., a new popular show release) that would overwhelm the origin.

Q: How does Netflix handle the initial buffering delay?

Netflix minimizes startup latency through several techniques: (1) Starting playback at a low bitrate to fill the initial buffer quickly, then ramping up quality. (2) Pre-fetching the first few segments of likely-to-be-watched content while the user is browsing. (3) Using short segment durations (2 seconds) so the player can start after downloading just one segment. (4) Edge caching of popular content ensures the first segments are served from nearby servers with minimal latency.

Q: What is the difference between HLS and DASH streaming protocols?

HLS (HTTP Live Streaming) was developed by Apple and uses .m3u8 manifest files and .ts or .fmp4 segments. DASH (Dynamic Adaptive Streaming over HTTP) is an open standard using .mpd manifests and .mp4 segments. Both protocols segment video into chunks and support adaptive bitrate switching. HLS has broader device support (especially Apple devices), while DASH offers more flexibility in segment formats. CMAF (Common Media Application Format) unifies segment formats so that a single set of encoded segments can serve both protocols.

Build a video delivery platform with adaptive bitrate streaming, CDN edge caching, transcoding pipelines, and content recommendation.

CDNStreamingProcessing

Try in Simulator

Problem Statement

Video streaming is one of the most resource-intensive system design problems because it combines large-scale data processing (transcoding), global content distribution (CDN), and real-time adaptive delivery (ABR streaming). Building a platform like Netflix or YouTube requires designing a system that can ingest, process, store, and deliver video content to millions of concurrent viewers with minimal buffering and latency across diverse network conditions and device capabilities.

At Netflix's scale, the platform serves over 200 million subscribers streaming an average of 2 hours per day, consuming over 15% of global internet bandwidth during peak hours. Each piece of content is transcoded into hundreds of renditions: multiple resolutions (480p, 720p, 1080p, 4K), multiple bitrates per resolution, and multiple codec formats (H.264, H.265, VP9, AV1). A single 2-hour movie might require 1TB of storage across all its renditions, and the transcoding pipeline must process new content within hours of ingestion.

The delivery challenge is equally formidable. Viewers expect instant playback start (under 2 seconds), zero buffering during playback, and seamless quality adaptation as network conditions change. This requires a global CDN with edge servers positioned close to viewers, intelligent client-side ABR algorithms that predict bandwidth and switch quality levels proactively, and origin servers that can handle cache misses without impacting the viewing experience.

This template models the end-to-end video platform: content ingestion service, transcoding pipeline with parallel processing, content delivery network with edge caching, ABR streaming server, recommendation engine, and analytics pipeline. The simulation shows how CDN cache hit rates affect origin server load, how transcoding parallelism reduces processing time, and how ABR algorithms respond to simulated network degradation.

Architecture Overview

The video streaming architecture is divided into two major subsystems: the content preparation pipeline (offline) and the content delivery system (real-time). On the preparation side, the Content Ingestion Service receives raw video files (often in ProRes or MXF format at 50+ Mbps), validates them, and submits transcoding jobs to the Transcoding Pipeline. This pipeline is the most compute-intensive component: it splits each video into segments (typically 2-6 seconds), distributes segments across a fleet of GPU-equipped workers for parallel transcoding, and produces the full rendition ladder (multiple resolution/bitrate/codec combinations).

Transcoded segments are stored in Object Storage (S3) organized by content ID, rendition, and segment number. A manifest file (HLS .m3u8 or DASH .mpd) is generated for each piece of content, listing all available renditions and their segment URLs. This manifest is the entry point for client playback.

On the delivery side, the CDN (Content Delivery Network) is the critical component. Edge servers distributed across hundreds of points of presence (PoPs) worldwide cache video segments close to viewers. When a viewer starts playback, their player fetches the manifest from the origin, then requests segments from the nearest edge server. If the segment is cached (cache hit), it is served directly with single-digit millisecond latency. On a cache miss, the edge server fetches from a regional mid-tier cache or the origin, caches the segment locally, and serves it to the viewer.

The ABR (Adaptive Bitrate) algorithm runs on the client (player). It monitors the download speed of each segment and the playback buffer level, then selects the highest quality rendition that can be downloaded faster than real-time playback. When network conditions degrade, the player steps down to a lower bitrate to avoid buffering. When conditions improve, it steps up for better quality. Modern ABR algorithms use buffer-based approaches (BBA) combined with bandwidth estimation for more stable quality switching.

The Recommendation Engine analyzes viewing history, content metadata, and collaborative filtering signals to personalize the content catalog for each user. Recommendations are pre-computed in batch (offline, every few hours) and served from a fast key-value store, with real-time adjustments based on the current session's viewing behavior.

Architecture Preview

Loading architecture preview...

Open in Simulator

Request Flow — Video Playback via CDN + ABR

The video streaming system operates in two distinct phases: an offline ingestion/transcoding pipeline that prepares content, and a real-time playback path that delivers video segments to viewers. The CDN is the hero of the read path — at scale, 95%+ of segment requests are served from edge caches with single-digit millisecond latency, meaning the origin servers handle only the long tail of unpopular content.

The transcoding pipeline is the most compute-intensive component. A single 4K video produces a full rendition ladder: 2160p, 1440p, 1080p, 720p, 480p, 360p — each at multiple bitrates. The video is split into 2-6 second segments (the fundamental unit of adaptive streaming), and each segment is encoded independently across a fleet of GPU workers. A 2-hour movie at 6 renditions produces ~7,200 segments.

Adaptive Bitrate (ABR) streaming is what makes the viewing experience smooth. The client-side ABR algorithm continuously monitors download speed and buffer level. If the network degrades, the player seamlessly steps down to a lower rendition (e.g., 1080p → 720p) without interrupting playback. If conditions improve, it steps back up. The manifest file (.m3u8 for HLS, .mpd for DASH) tells the player where to find each segment at each quality level.

Loading diagram...

Step-by-Step Walkthrough

1Content creators upload raw video (typically 4K master) to the Content Ingestion Service. The service validates the format, extracts metadata (duration, codec, resolution), and submits a transcoding job to the pipeline.
2The Transcoding Pipeline splits the video into fixed-duration segments (2-6 seconds each, typically 4s). Each segment is the atomic unit of adaptive streaming — the player can switch quality between segments without visible artifacts.
3GPU workers encode each segment into the full rendition ladder: 2160p@15Mbps, 1080p@5Mbps, 720p@2.5Mbps, 480p@1Mbps, 360p@0.5Mbps, 240p@0.3Mbps. Segments are processed in parallel across the GPU fleet for speed (~30-60 min for a 2-hour movie).
4Transcoded segments are stored in Object Storage (S3) with a deterministic path structure: /{contentId}/{rendition}/{segment_N}.ts. This structure enables both CDN cache keying and direct origin fetches.
5The Manifest Generator creates an HLS master playlist (.m3u8) listing all available renditions with their bandwidth and resolution metadata, plus per-rendition playlists mapping segment numbers to URLs. This manifest is the entry point for playback.
6When a viewer presses play, the client fetches the manifest from the CDN and parses the available renditions. The ABR algorithm selects an initial quality based on the client's estimated bandwidth (measured from the manifest download itself).
7Every 4 seconds, the client requests the next segment at the ABR-selected rendition. The CDN edge server serves 95% of requests from its local cache (~3ms). On cache miss, the request falls through to a regional mid-tier cache (~20ms) or all the way to the S3 origin (~80ms).
8The ABR algorithm continuously adapts: if download speed drops (measured per-segment), it steps down to a lower rendition for the next segment. If buffer level exceeds 30 seconds and bandwidth is stable, it steps up. Quality switches are seamless because each segment is independently decodable.
9The Recommendation Engine runs separately — it pre-computes personalized content rankings using collaborative filtering + viewing history + content metadata, served from a key-value cache (~15ms). Recommendations are refreshed in batch (hourly) and augmented with real-time signals (trending, new releases).

Pseudocode

// Transcoding Pipeline — parallel segment encoding
async function transcodeVideo(contentId, rawVideoPath):
    // 1. Split into segments
    segments = await ffmpegSplit(rawVideoPath, segmentDuration: 4)
    // e.g., 2-hour movie → 1,800 segments

    // 2. Encode each segment at every rendition (parallel)
    renditions = ["2160p", "1080p", "720p", "480p", "360p", "240p"]
    jobs = []
    for segment in segments:
        for rendition in renditions:
            jobs.push(gpuWorkerPool.submit(
                encodeSegment, segment, rendition, bitrateMap[rendition]
            ))
    await Promise.all(jobs)   // 1,800 × 6 = 10,800 encode jobs

    // 3. Store in S3
    for job in jobs:
        await s3.putObject(
            `${contentId}/${job.rendition}/segment_${job.index}.ts`,
            job.output
        )

    // 4. Generate manifest
    manifest = generateHLSManifest(contentId, renditions, segments.length)
    await s3.putObject(`${contentId}/master.m3u8`, manifest)

// Client-side ABR — adaptive quality selection
function selectRendition(manifest, bandwidth, bufferLevel):
    // Sort renditions by bitrate descending
    renditions = manifest.renditions.sortBy(r => -r.bandwidth)

    // Select highest rendition that fits within 80% of available bandwidth
    for rendition in renditions:
        if rendition.bandwidth < bandwidth * 0.8:
            // Buffer-based safety: if buffer < 10s, step down one more
            if bufferLevel < 10 && rendition !== renditions.last:
                return renditions[renditions.indexOf(rendition) + 1]
            return rendition

    return renditions.last   // Lowest quality as fallback

Key Design Decisions

Transcoding Architecture

Choice

Segment-parallel transcoding with per-title encoding

Rationale

Splitting video into segments and transcoding them in parallel across a GPU fleet reduces wall-clock processing time from hours to minutes. Per-title encoding optimizes the bitrate ladder for each piece of content — an animated movie compresses more efficiently than a live-action sports event at the same perceptual quality — reducing storage costs by 20-50% while maintaining visual quality.

CDN Cache Strategy

Choice

Three-tier caching: edge PoP, regional mid-tier, origin shield

Rationale

A single cache tier results in frequent origin hits for less popular content (the long tail). Three tiers progressively aggregate demand: edge PoPs serve the most popular segments, regional mid-tiers handle moderate demand, and an origin shield protects the origin from thundering herd effects. This architecture achieves 95%+ cache hit rates at the edge for popular content while limiting origin load to a few percent of total traffic.

Streaming Protocol

Choice

HLS with CMAF segments for broad device compatibility

Rationale

HLS (HTTP Live Streaming) is supported by virtually all devices and browsers. CMAF (Common Media Application Format) segments are compatible with both HLS and DASH manifests, enabling a single set of encoded segments to serve all clients. HTTP-based delivery leverages existing CDN infrastructure without requiring specialized streaming servers.

Recommendation Algorithm

Choice

Hybrid collaborative filtering + content-based with batch pre-computation

Rationale

Collaborative filtering captures taste patterns across users (people who watched X also watched Y). Content-based filtering matches based on genre, cast, and metadata. The hybrid approach covers both scenarios: collaborative filtering works well for users with viewing history, while content-based handles the cold-start problem for new users. Batch pre-computation avoids expensive model inference on every page load.

Scale & Performance

Target RPS

1M concurrent streams

Latency (p99)

<2s (playback start)

Storage

~100 PB (all renditions)

Availability

99.99%

This template is for educational and illustration purposes only. It may not represent the optimal production design for this problem. Real-world systems involve additional considerations (compliance, specific cloud provider constraints, organizational requirements) not captured here. Use this as a starting point for discussion, not as a production blueprint.

Frequently Asked Questions

How does adaptive bitrate streaming work?

Adaptive bitrate (ABR) streaming encodes each video in multiple quality levels (renditions). The client player monitors download speed and buffer levels, then dynamically selects the highest quality rendition that can be downloaded faster than real-time playback. When bandwidth drops (e.g., switching from Wi-Fi to cellular), the player requests lower-quality segments to avoid buffering. When bandwidth recovers, it requests higher-quality segments for a better viewing experience.

Why does Netflix encode each video in hundreds of renditions?

Netflix uses per-title encoding where the bitrate ladder is optimized for each piece of content. An animated movie compresses much more efficiently than a live-action sports event. Additionally, different codecs (H.264, H.265, VP9, AV1) have different device compatibility and compression efficiency. Multiplying content-specific bitrate ladders by supported codecs produces hundreds of renditions per title, totaling roughly 1TB of storage per 2-hour movie.

How does a CDN reduce video streaming latency?

A CDN caches video segments on edge servers distributed across hundreds of geographic locations (Points of Presence). When a viewer in Tokyo requests a video, the segment is served from a nearby Tokyo PoP rather than from the origin server in the US. This reduces round-trip time from hundreds of milliseconds to single-digit milliseconds. The CDN also absorbs traffic spikes (e.g., a new popular show release) that would overwhelm the origin.

How does Netflix handle the initial buffering delay?

Netflix minimizes startup latency through several techniques: (1) Starting playback at a low bitrate to fill the initial buffer quickly, then ramping up quality. (2) Pre-fetching the first few segments of likely-to-be-watched content while the user is browsing. (3) Using short segment durations (2 seconds) so the player can start after downloading just one segment. (4) Edge caching of popular content ensures the first segments are served from nearby servers with minimal latency.

What is the difference between HLS and DASH streaming protocols?

HLS (HTTP Live Streaming) was developed by Apple and uses .m3u8 manifest files and .ts or .fmp4 segments. DASH (Dynamic Adaptive Streaming over HTTP) is an open standard using .mpd manifests and .mp4 segments. Both protocols segment video into chunks and support adaptive bitrate switching. HLS has broader device support (especially Apple devices), while DASH offers more flexibility in segment formats. CMAF (Common Media Application Format) unifies segment formats so that a single set of encoded segments can serve both protocols.

Related Templates

Social Feed (Twitter/X)Notification System Real-Time Chat (WhatsApp)

Discussion

Ready to design your own Video Streaming (Netflix/YouTube)?

Open the simulator, place components on the canvas, wire them up, and run a traffic simulation to see how your architecture performs under real load.

Open Simulator