Build Netflix's Content Delivery Network


caching distributed-systems performance

System Design Deep Dive

Netflix Content Delivery Network

Predicting which petabytes to place at which edge before users ask - without poisoning the cache with content nobody watches

⏱ 14 min read📐 Advanced🏗️ Edge

Think of Netflix’s content library as a warehouse full of goods that 280 million customers want delivered instantly. Now imagine you must pre-position those goods across 200+ local depots worldwide - before you know exactly what each customer will order. You have limited shelf space at each depot, a catalog of 15,000+ titles (each weighing up to a terabyte in all its bitrate variants), and a hard rule: the goods must arrive within 50ms of the customer pressing “play.” A centralized warehouse would fail immediately - the speed of light from a datacenter in Virginia to a user in Mumbai cannot be compressed by any software trick. Proximity to the user is the only physics-compliant answer.

Netflix serves approximately 15 petabytes of video data per day, with peak traffic exceeding 600 gigabits per second in North America alone. That scale forces a counter-intuitive architectural choice: Netflix does not rely primarily on commercial CDNs like Akamai or Cloudflare. Instead, it built its own CDN called Open Connect, placing purpose-built appliances (OCAs - Open Connect Appliances) directly inside Internet Service Provider networks. An OCA sitting inside a Comcast datacenter needs zero hops over the public internet to reach a Comcast subscriber - the video bytes travel through the ISP’s own internal routing. Netflix pays the ISP with bandwidth savings rather than cash; the ISP saves on transit costs; the subscriber gets sub-millisecond latency to the serving node.

The hard problem is not the serving layer itself - flash storage is fast and HTTP/2 byte-range streaming is mature. The hard problem is the cache population decision: which of the 15,000+ titles do you preload into each PoP tonight, given that you only have 100TB of flash storage and the catalog is 1 exabyte? Get this wrong and users see buffering. Get it too conservative and cache hit rates drop below 95%, flooding origin servers. The decision must account for time-of-day viewing patterns, local content preferences (Korean dramas dominate Seoul’s PoP; Telenovelas dominate Mexico City’s), upcoming release windows, and content that is trending on social media before it spikes in requests.

We need to solve for three things simultaneously: cache placement (what goes where, decided offline each night by the Cache Advisor), cache eviction (what gets bumped when new content must enter a full cache), and cache miss handling (how to serve a request for content not yet at the local PoP without slamming origin). The routing and consistency model that glues these together is consistent hashing, ensuring that the same title maps to the same PoP nodes regardless of the number of appliances in a cluster - so a cache miss from one appliance doesn’t trigger a redundant fetch from another.

Requirements and Constraints

Functional Requirements

  • Serve video chunks (typically 2-10 second MPEG-TS or CMAF segments) with under 50ms time-to-first-byte at the edge
  • Prefill PoP caches nightly based on predicted popularity per geographic region
  • Handle cache miss by routing to a regional cluster (origin shield), not directly to S3 origin
  • Evict cache entries based on a combined recency and frequency score, not pure LRU or pure LFU
  • Invalidate stale cached content (e.g., after a content re-encode or takedown request) within 60 seconds across all PoPs
  • Support at least 200 concurrent open connections per appliance per title being streamed
  • Serve all bitrate variants of a title (240p through 4K HDR), totaling 10-60 GB per episode

Non-Functional Requirements

  • Cache hit rate: greater than 95% for top-1000 titles at any active PoP
  • Bandwidth: sustain 600 Gbps peak throughput across North America PoPs
  • Latency: p99 TTFB under 50ms from PoP to user device
  • Storage: each Edge PoP appliance holds 100TB (flash NVMe) for hot content; Regional Clusters hold 5PB (mixed HDD+SSD) for the warm tier
  • Availability: 99.99% video start success rate (VSS); a single PoP failure must not interrupt any active streams
  • Prefetch completion: nightly cache fill must complete before peak viewing hours (8pm-midnight local time)
  • Invalidation: cache purge propagation under 60 seconds to all affected PoPs globally

Constraints and Assumptions

  • We focus on long-form video delivery (episodes, movies); live streaming uses a separate pipeline
  • All video is pre-encoded; adaptive bitrate streaming (ABR) selection happens on the client device
  • Content catalog changes slowly relative to the prefetch cycle (daily additions, not real-time)
  • ISP-hosted PoPs have dedicated 40Gbps-100Gbps fiber uplinks to the ISP backbone
  • We assume DNS-based steering resolves the correct PoP for each user

High-Level Architecture

The system has five major components: the Steering Service (routes user requests to the optimal PoP), the Edge PoP Appliances (serve cached content to users), the Regional Clusters (origin shield tier, absorbs misses), the Cache Advisor (computes nightly fill decisions), and the S3 Origin (authoritative source of all encoded video).

Netflix Open Connect CDN architecture showing user devices, edge PoPs, regional clusters, and origin

When a user presses play, their Netflix app requests a manifest file from the API tier. The manifest contains segment URLs pointing at the CDN. The client’s first segment request goes to the Steering Service - a DNS-based redirector that knows the health, load, and geographic proximity of every active PoP. The Steering Service returns a redirect to the optimal PoP. If that PoP has the segment cached, it serves the bytes directly from NVMe flash. If not (a cache miss), the PoP checks its sibling appliances in the same cluster via a consistent hash lookup. If the segment is not in any appliance in the cluster, the request escalates to the Regional Cluster (origin shield), which has a broader 5PB cache. Only if the Regional Cluster also misses does a request reach S3 origin - an event that should happen fewer than 5% of the time.

The Cache Advisor runs as an offline batch job each afternoon. It ingests the previous 7 days of view logs, applies a popularity-forecasting model per PoP region, weights the predictions by content release schedule (a new season dropping tomorrow gets mandatory placement regardless of current popularity score), and outputs a ranked fill list for each active PoP. The fill list is pushed to the appliances as a download queue. Appliances pull their fill content from the Regional Cluster’s warm tier during off-peak hours (typically 2am-6am local time), avoiding contention with live user traffic.

The S3 Origin holds the master encoded files for every title in every bitrate variant. These are immutable once created; any re-encode creates a new object with a new URL. Content takedowns work by removing the file from origin and pushing an invalidation directive to all PoPs - the cached copy becomes unreachable because the Steering Service stops routing to its URL before the cached bytes are physically deleted.

Key Insight

The defining architectural decision in Open Connect is co-location with ISPs rather than independent PoP placement. By sitting inside the ISP’s network, Netflix eliminates internet transit costs entirely for the ISP, which creates an economic incentive for ISPs to host the appliances. The CDN’s “cost” to Netflix is hardware, not bandwidth - a one-time capital expense rather than a recurring per-GB transit fee.

The Steering Service

The Steering Service is the traffic cop for every video request. Its job is deceptively simple - map a user’s IP address and content request to the best available PoP URL - but the failure modes are severe: route to a sick PoP and thousands of users get buffering simultaneously.

Netflix’s steering uses a combination of BGP route data (which PoP is closest at the IP routing layer), real-time health scores (each PoP reports its current CPU utilization, disk throughput, and active connection count every 10 seconds), and EUFA (End User Functioning Area) groupings (geographic clusters of ISP customers who should share the same PoP). The steering logic prefers the closest healthy PoP, falls back to a secondary PoP in the same metro if the primary is overloaded (above 80% disk throughput utilization), and falls back to a regional cluster if both PoPs are unhealthy.

DNS-based steering has a well-known problem: DNS TTLs create stickiness. If a PoP becomes overloaded, new clients won’t be redirected away until their DNS cache expires (typically 30-60 seconds). Netflix addresses this by having clients send a small “steering check” HTTP request to the Steering Service before each new streaming session - not just on DNS resolution. The steering response is a redirect URL, not a DNS record, so it can change with zero TTL.

# Steering service - select optimal PoP for a user request
from dataclasses import dataclass
from typing import Optional
import hashlib

@dataclass
class PoPHealth:
    pop_id: str
    region: str
    isp_asn: int
    disk_throughput_pct: float  # 0-100, current utilization
    cpu_pct: float
    active_connections: int
    max_connections: int
    is_healthy: bool

def select_pop(
    user_asn: int,
    user_lat: float,
    user_lon: float,
    title_id: str,
    pops: list[PoPHealth],
) -> Optional[str]:
    """
    Returns the URL of the best PoP for this user + title combination.
    Prefers same-ASN PoPs (zero-transit path), then geographic proximity.
    """
    # Phase 1: filter unhealthy and overloaded PoPs
    eligible = [
        p for p in pops
        if p.is_healthy
        and p.disk_throughput_pct < 80.0
        and p.active_connections < p.max_connections * 0.9
    ]

    if not eligible:
        return None  # fallback to regional cluster

    # Phase 2: strongly prefer same-ASN PoP (avoids public internet transit)
    same_asn = [p for p in eligible if p.isp_asn == user_asn]
    if same_asn:
        # Among same-ASN PoPs, pick lowest current load
        return min(same_asn, key=lambda p: p.disk_throughput_pct).pop_id

    # Phase 3: fall back to closest PoP by geographic distance
    def haversine_approx(lat1, lon1, lat2, lon2):
        import math
        dlat = math.radians(lat2 - lat1)
        dlon = math.radians(lon2 - lon1)
        a = math.sin(dlat/2)**2 + math.cos(math.radians(lat1)) * \
            math.cos(math.radians(lat2)) * math.sin(dlon/2)**2
        return 2 * math.asin(math.sqrt(a)) * 6371

    pop_coords = {
        "pop-nyc": (40.71, -74.00),
        "pop-lax": (34.05, -118.24),
        "pop-lhr": (51.51, -0.12),
    }

    for pop in eligible:
        coords = pop_coords.get(pop.pop_id, (0, 0))
        pop.distance_km = haversine_approx(user_lat, user_lon, coords[0], coords[1])

    return min(eligible, key=lambda p: getattr(p, 'distance_km', 9999)).pop_id
Real World

Netflix’s Steering Service is built on top of AWS Route 53 geo-routing combined with a proprietary health-check system. When they first deployed Open Connect in 2012, they discovered that naive geographic proximity routing sent too much traffic through overloaded Tier-1 ISP peering points. The ASN-based routing preference - routing within the ISP’s own network whenever possible - reduced average TTFB by 40ms and eliminated a class of buffering that was caused by internet backbone congestion, not PoP capacity.

Cache Population and Prefetching

Every PoP has a fixed storage budget - 100TB of NVMe flash. The Netflix catalog in all its bitrate variants totals far more than any single PoP can hold. The Cache Advisor must answer: “given 100TB, which 100TB of content should live at PoP-NYC tonight?”

The naive approach is to cache by global popularity: just put the globally top-100 titles everywhere. This fails for two reasons. First, content preferences vary enormously by region - an anime series that is number 1 in Japan may rank 3,000th in Brazil. A PoP serving Japanese ISPs that caches Brazilian telenovelas wastes capacity. Second, global popularity lags viral trends: a series that starts trending on Twitter at 9pm will see a 10x traffic spike by 10pm, long before the next nightly fill cycle. Pure historical popularity misses the window entirely.

The Cache Advisor uses a multi-signal popularity forecast per PoP region:

  • 7-day view count for that PoP’s geographic region, weighted by recency (yesterday’s views count 2x more than last week’s)
  • Search and browse signals: titles users are browsing but not yet watching are early demand signals
  • Release calendar: all episodes of a new season dropping tomorrow get forced placement regardless of current score
  • Social trending score: a third-party social media API tracks title mentions in the PoP’s primary language; a 100% spike in mentions triggers a pre-emptive boost
  • Content size: large titles (a 4K HDR blockbuster might be 60GB per episode) get penalized in score relative to their storage cost, since 60GB displaces 15 smaller titles from the cache
# Cache Advisor - compute fill priority score for a PoP region
from dataclasses import dataclass
from typing import Dict
import math

@dataclass
class TitleSignals:
    title_id: str
    region_views_7d: int          # views from this PoP's region in last 7 days
    region_views_1d: int          # views from this PoP's region in last 24h
    global_views_7d: int          # global views for trending context
    social_mention_delta_pct: float  # % change in social mentions last 24h
    release_date_delta_days: int  # days until new release (negative = past)
    total_size_gb: float          # storage cost for all bitrates

def compute_fill_score(signals: TitleSignals, pop_capacity_gb: float) -> float:
    """
    Returns a priority score for prefilling this title to a PoP.
    Higher score = should be cached sooner and evicted later.
    """
    # Base score: regional popularity weighted toward recent
    recency_weighted_views = (
        signals.region_views_1d * 2.0 + signals.region_views_7d * 0.5
    )
    base_score = math.log1p(recency_weighted_views)

    # Boost for upcoming release (window: -3 to +7 days)
    release_boost = 0.0
    if -3 <= signals.release_date_delta_days <= 7:
        # Stronger boost the closer to release day
        days_abs = abs(signals.release_date_delta_days)
        release_boost = max(0.0, 5.0 - days_abs * 0.5)

    # Boost for social trending (viral content)
    social_boost = 0.0
    if signals.social_mention_delta_pct > 50.0:
        social_boost = math.log1p(signals.social_mention_delta_pct / 50.0) * 2.0

    # Penalty for large content (storage efficiency factor)
    # Normalize by median episode size (4 GB)
    size_penalty = math.log1p(signals.total_size_gb / 4.0) * 0.3

    final_score = base_score + release_boost + social_boost - size_penalty
    return max(0.0, final_score)

def build_fill_list(
    all_titles: list[TitleSignals],
    pop_capacity_gb: float,
) -> list[str]:
    """
    Returns ordered list of title_ids to fill, stopping when capacity is full.
    Forced (upcoming release within 3 days) titles always go first.
    """
    forced = [t for t in all_titles if -3 <= t.release_date_delta_days <= 3]
    optional = [t for t in all_titles if t not in forced]

    forced_sorted = sorted(forced, key=lambda t: compute_fill_score(t, pop_capacity_gb), reverse=True)
    optional_sorted = sorted(optional, key=lambda t: compute_fill_score(t, pop_capacity_gb), reverse=True)

    fill_list = []
    used_gb = 0.0
    for title in forced_sorted + optional_sorted:
        if used_gb + title.total_size_gb > pop_capacity_gb:
            break
        fill_list.append(title.title_id)
        used_gb += title.total_size_gb

    return fill_list
Watch Out

The biggest prefetch failure mode is over-filling a cache with content that never gets watched. A 4K HDR encode of a title that ranks 8,000th in a region occupies 100GB that could hold 25 episodes of a locally popular series. The storage cost penalty in the fill score is critical - without it, the fill algorithm always prefers large 4K blockbusters over an efficient mix of smaller popular titles, and hit rates collapse for everything outside the top-20.

Consistent Hashing for PoP Clusters

Each PoP location typically hosts between 4 and 30 physical OCA appliances. When a request arrives at a cluster, how does the load balancer decide which appliance to check? The naive answer - random or round-robin - fails because it spreads misses across all appliances. If title “Stranger Things S5” is not in appliance A, a miss-and-fetch from the Regional Cluster populates only appliance A. The next request might land on appliance B, which also misses, triggering another origin fetch. Under random routing, you’d need N fetches for an N-appliance cluster to warm up for a given title.

Consistent hashing solves this by mapping each title to a deterministic set of appliances. Think of a circular ring numbered 0 to 2^32. Each appliance maps to 150 virtual positions on the ring (virtual nodes improve load balance when appliances are added or removed). A title’s hash maps to a single position on the ring; the request routes to the appliance whose position is nearest clockwise. Every request for the same title goes to the same appliance (as long as that appliance is healthy), so the cache warms with a single fetch and subsequent requests hit reliably.

Consistent hash ring for PoP cluster appliance assignment showing token ranges per appliance

When an appliance is added to a cluster, only the titles that map to the range of positions adjacent to the new appliance need to migrate. This is approximately total_titles / (current_appliances + 1) titles - a small fraction. Without consistent hashing, adding a single appliance would require rehashing all titles, triggering a full cache miss storm as every title simultaneously re-routes to a different appliance.

# Consistent hash ring for routing title requests to appliances within a PoP cluster
import hashlib
import bisect
from typing import Optional

class ConsistentHashRing:
    def __init__(self, virtual_nodes_per_server: int = 150):
        self.virtual_nodes = virtual_nodes_per_server
        self.ring: dict[int, str] = {}  # hash_point -> appliance_id
        self.sorted_keys: list[int] = []
        self.live_appliances: set[str] = set()

    def _hash(self, key: str) -> int:
        return int(hashlib.md5(key.encode()).hexdigest(), 16) % (2**32)

    def add_appliance(self, appliance_id: str):
        """Add a new appliance with virtual_nodes positions on the ring."""
        self.live_appliances.add(appliance_id)
        for i in range(self.virtual_nodes):
            virtual_key = f"{appliance_id}:vnode:{i}"
            h = self._hash(virtual_key)
            self.ring[h] = appliance_id
            bisect.insort(self.sorted_keys, h)

    def remove_appliance(self, appliance_id: str):
        """Remove an appliance (e.g., failed health check)."""
        self.live_appliances.discard(appliance_id)
        for i in range(self.virtual_nodes):
            virtual_key = f"{appliance_id}:vnode:{i}"
            h = self._hash(virtual_key)
            if h in self.ring:
                del self.ring[h]
                idx = bisect.bisect_left(self.sorted_keys, h)
                if idx < len(self.sorted_keys) and self.sorted_keys[idx] == h:
                    self.sorted_keys.pop(idx)

    def get_appliance(self, title_segment_key: str) -> Optional[str]:
        """
        Returns the appliance_id responsible for this title+segment key.
        Returns None if the ring is empty.
        """
        if not self.sorted_keys:
            return None
        h = self._hash(title_segment_key)
        idx = bisect.bisect_right(self.sorted_keys, h)
        if idx == len(self.sorted_keys):
            idx = 0  # wrap around the ring
        return self.ring[self.sorted_keys[idx]]

    def get_replica_appliances(self, title_segment_key: str, replicas: int = 2) -> list[str]:
        """
        Returns N distinct appliances for the same key (for redundancy).
        Each serves as a failover if the primary is unavailable.
        """
        if not self.sorted_keys:
            return []
        h = self._hash(title_segment_key)
        idx = bisect.bisect_right(self.sorted_keys, h)
        result = []
        seen = set()
        for _ in range(len(self.sorted_keys)):
            idx = idx % len(self.sorted_keys)
            app_id = self.ring[self.sorted_keys[idx]]
            if app_id not in seen:
                result.append(app_id)
                seen.add(app_id)
                if len(result) == replicas:
                    break
            idx += 1
        return result
Key Insight

Consistent hashing only works well for cache routing if the ring is balanced. Without virtual nodes, adding a 4th appliance to a 3-appliance cluster might leave one appliance handling 60% of titles (if it won the coin flip on ring positions). With 150 virtual nodes per appliance, the expected load imbalance drops to under 5% - the same statistical smoothing that makes bloom filters effective.

Cache Eviction Strategy

A full 100TB cache must evict something whenever new content arrives. The choice of eviction policy has a measurable impact on cache hit rate - the difference between a 95% hit rate and a 90% hit rate at scale translates to a 5x increase in origin traffic.

LRU (Least Recently Used) evicts whatever was accessed longest ago. This works for general-purpose caches but fails for Netflix’s viewing patterns in two ways. First, Netflix viewers have binge cycles: they watch 8 episodes in a row on Saturday and then not again for 3 weeks. Under LRU, a title that had 10,000 views on Saturday would be evicted by Monday, right before the next binge wave. Second, LRU is vulnerable to scan attacks - a one-time full-catalog sweep (like a quality control job that accesses every segment once) evicts all hot content in favor of content that was only accessed during the sweep.

LFU (Least Frequently Used) evicts whatever has the lowest total access count. This solves the binge cycle problem (highly accessed content stays) but creates a novelty trap: new releases with zero history are always evicted first, even if they are trending explosively. A new season of a popular show would be purged within hours on a full cache because its hit count starts at zero.

Netflix uses a popularity-weighted score that combines both frequency and recency:

# Popularity-weighted eviction score for CDN cache entries
import math
import time

def eviction_score(
    hit_count: int,
    first_accessed_epoch: float,
    last_accessed_epoch: float,
    content_size_gb: float,
    half_life_days: float = 3.0,
) -> float:
    """
    Compute a keep-score for a cache entry.
    Higher score = should stay in cache longer.
    Entries with the LOWEST score are evicted first.

    Formula: (frequency_score * recency_weight) / size_cost
    """
    now = time.time()

    # Frequency component: log-compressed hit count
    frequency_score = math.log1p(hit_count)

    # Recency component: exponential decay from last access
    age_days = (now - last_accessed_epoch) / 86400.0
    decay_constant = math.log(2) / half_life_days
    recency_weight = math.exp(-decay_constant * age_days)

    # Novelty bonus: content accessed for the first time in the last 24 hours
    # prevents new viral content from being immediately evicted
    is_new = (now - first_accessed_epoch) < 86400.0
    novelty_bonus = 2.0 if is_new else 1.0

    # Size cost: penalize large content relative to its popularity per GB
    # A 60GB movie with 100 hits contributes less value per GB than a 4GB episode with 100 hits
    size_cost = math.log1p(content_size_gb)

    score = (frequency_score * recency_weight * novelty_bonus) / size_cost
    return score
LRU vs LFU vs popularity-weighted eviction policy comparison with example cache state
Real World

Netflix published a 2017 tech blog post describing their cache eviction approach as based on “LRU-K” - a variant of LRU that tracks the K most recent access times rather than just the most recent one. The K=2 version is essentially an approximation of what we described: it keeps content that has been accessed multiple times recently, distinguishing between a true hot item and a one-time scan. The novelty bonus we added is Netflix’s adaptation for handling new release windows.

The Origin Shield Pattern

The worst failure mode for any CDN is a cache stampede at origin: all PoPs simultaneously miss on a newly released title and all send requests to S3 origin. For a title releasing globally at midnight UTC, with 2,000+ active PoPs, that could mean 2,000 simultaneous 60GB downloads from S3 - a 120TB burst that would saturate even a large S3 bucket’s egress capacity.

The Regional Cluster acts as an origin shield to prevent this. There are approximately 10-15 regional clusters globally, each covering a geographic group of PoPs. The hierarchy works like this: Edge PoP misses go to the Regional Cluster; Regional Cluster misses go to S3 origin. With 15 clusters and 2,000 PoPs, a miss storm produces at most 15 simultaneous S3 requests rather than 2,000. The shield absorbs the amplification.

Regional Clusters use a different caching strategy than Edge PoPs: they cache the full catalog at a lower bitrate set (everything up to 1080p), with 4K content only cached for the most popular titles. The goal is not latency optimization (Regional Clusters are not as close to users as Edge PoPs) but miss protection: as long as the Regional Cluster has the content, no request ever touches S3.

# Open Connect Appliance configuration example (simplified)
# Deployed on each OCA (Open Connect Appliance) at edge PoPs

oca_config:
  appliance_id: "oca-nyc-comcast-42"
  region: "us-east"
  isp_asn: 7922  # Comcast ASN
  storage:
    flash_capacity_gb: 102400  # 100 TB NVMe flash
    reserved_headroom_pct: 5   # keep 5% free for burst writes
  cache:
    eviction_policy: "popularity_weighted"
    eviction_half_life_days: 3
    min_hit_count_before_evict_eligible: 2  # new content gets a grace period
    max_segment_size_gb: 4                  # max segment size for a single cache entry
  prefetch:
    fill_window_start: "02:00"  # local PoP time
    fill_window_end: "06:00"
    fill_source: "regional_cluster"  # always pull from shield, never from S3
    max_concurrent_fill_streams: 20
  miss_handling:
    primary_upstream: "regional-cluster-us-east.openconnect.netflix.internal"
    fallback_upstream: "regional-cluster-us-central.openconnect.netflix.internal"
    connect_timeout_ms: 500
    read_timeout_ms: 5000
  health_reporting:
    interval_seconds: 10
    report_endpoint: "steering.openconnect.netflix.com"
    metrics: ["disk_throughput_pct", "active_connections", "cache_hit_rate_1m"]
Key Insight

The origin shield is not just a cache - it is a concurrency limiter. For a title releasing globally at midnight, the Regional Cluster can receive 500 concurrent PoP miss requests but will deduplicate them into a single fetch from S3 using a “request coalescing” pattern: the first miss triggers the S3 fetch; subsequent misses for the same content block and wait for the first fetch to complete before being served from the cluster cache. This is the same technique nginx uses with its proxy_cache_lock directive.

Cache Invalidation

Netflix’s CDN must support two types of invalidation: content removal (a title is taken down due to licensing expiry) and content replacement (an episode segment is re-encoded with better compression). Both require pushing a “stop serving this URL” directive to all 200+ PoPs within 60 seconds.

Netflix uses a combination of URL versioning and explicit purge propagation. All segment URLs embed a content hash: /content/{title_id}/{episode_id}/{bitrate}/{segment_seq}/{content_hash}.ts. When an episode is re-encoded, the content hash changes, so the new URL is different. Old PoPs continue serving the old bytes until they expire naturally (via the eviction score dropping to zero). For urgent removal (copyright violation, content recall), an explicit purge directive is published to a Kafka topic that all OCA appliances subscribe to.

-- Cache invalidation log: tracks active purge directives per content URL
CREATE TABLE cache_purge_directive (
    directive_id    BIGINT          GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    title_id        VARCHAR(64)     NOT NULL,
    url_pattern     VARCHAR(512)    NOT NULL,  -- prefix or exact match
    reason          VARCHAR(32)     NOT NULL CHECK (reason IN ('takedown', 'reencode', 'abuse')),
    issued_at       TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    expires_at      TIMESTAMPTZ     NOT NULL,  -- how long PoPs must block this URL
    issued_by       VARCHAR(128)    NOT NULL,  -- service account
    acknowledgments JSONB           NOT NULL DEFAULT '{}',  -- pop_id -> ack_time
    status          VARCHAR(16)     NOT NULL DEFAULT 'active'
        CHECK (status IN ('active', 'expired', 'revoked'))
);

CREATE INDEX idx_purge_active ON cache_purge_directive (status, expires_at)
    WHERE status = 'active';

CREATE INDEX idx_purge_title ON cache_purge_directive (title_id, status);
Watch Out

URL-versioning-based invalidation has a subtle race condition: a client that received a manifest with old segment URLs will keep requesting old URLs for the duration of their stream even after the content is taken down. Without an explicit in-flight blocking check at the OCA (not just removing the cached file), the OCA will serve the old bytes from cache until they are evicted. The correct fix is for OCAs to check the purge directive list on every request - not just on cache writes - and return HTTP 410 Gone for purged URLs even if the bytes are still physically present on disk.

Data Model

CDN request data flow showing cache hit path, miss escalation, and origin shield pattern
-- Core data model for Open Connect cache management

-- Catalog of all cacheable content segments
CREATE TABLE cdn_segment (
    segment_id          BIGINT          GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    title_id            VARCHAR(64)     NOT NULL,
    episode_id          VARCHAR(64),    -- NULL for movies
    bitrate_kbps        INTEGER         NOT NULL,
    resolution          VARCHAR(16)     NOT NULL,  -- 240p, 480p, 720p, 1080p, 2160p
    segment_sequence    INTEGER         NOT NULL,  -- 0-indexed segment within episode
    duration_ms         INTEGER         NOT NULL,
    size_bytes          BIGINT          NOT NULL,
    content_hash        CHAR(32)        NOT NULL,  -- MD5 of segment bytes
    s3_uri              VARCHAR(512)    NOT NULL,  -- authoritative source
    codec               VARCHAR(16)     NOT NULL DEFAULT 'h264',
    created_at          TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    UNIQUE (title_id, episode_id, bitrate_kbps, segment_sequence, content_hash)
);
CREATE INDEX idx_segment_title ON cdn_segment (title_id, bitrate_kbps);

-- Per-PoP cache state (managed by OCAs, reported to central inventory)
CREATE TABLE pop_cache_entry (
    pop_id              VARCHAR(64)     NOT NULL,
    segment_id          BIGINT          NOT NULL REFERENCES cdn_segment(segment_id),
    cached_at           TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    last_accessed_at    TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    hit_count           BIGINT          NOT NULL DEFAULT 0,
    eviction_score      FLOAT,          -- recomputed hourly by OCA
    PRIMARY KEY (pop_id, segment_id)
);
CREATE INDEX idx_cache_eviction ON pop_cache_entry (pop_id, eviction_score ASC)
    WHERE eviction_score IS NOT NULL;

-- Fill queue: segments scheduled for prefetch to a given PoP
CREATE TABLE pop_fill_queue (
    queue_id            BIGINT          GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    pop_id              VARCHAR(64)     NOT NULL,
    segment_id          BIGINT          NOT NULL REFERENCES cdn_segment(segment_id),
    fill_priority       FLOAT           NOT NULL,  -- from Cache Advisor score
    scheduled_at        TIMESTAMPTZ     NOT NULL DEFAULT NOW(),
    started_at          TIMESTAMPTZ,
    completed_at        TIMESTAMPTZ,
    status              VARCHAR(16)     NOT NULL DEFAULT 'pending'
        CHECK (status IN ('pending', 'in_progress', 'completed', 'failed'))
);
CREATE INDEX idx_fill_queue_pending ON pop_fill_queue (pop_id, fill_priority DESC)
    WHERE status = 'pending';

Key Algorithms and Protocols

Popularity-Based Prefetch Forecast

The Cache Advisor forecasts demand per PoP region using a simple but effective two-component model. The base forecast uses exponentially weighted moving average (EWMA) of view counts per title per region, with a short-term window (1 day) blended with a long-term window (7 days). The release calendar injects step-function demand increases for content releasing within 72 hours.

# EWMA popularity forecast for prefetch decisions
from typing import List
import math

def ewma_forecast(
    daily_views: List[float],  # most recent day first, up to 30 days
    alpha_short: float = 0.3,  # weight for short-term trend (recent days)
    alpha_long: float = 0.05,  # weight for long-term baseline
) -> float:
    """
    Combines short-term and long-term EWMA to forecast next-day views.
    Short-term EWMA is reactive to viral spikes; long-term provides stability.
    Returns predicted views for tomorrow.
    """
    if not daily_views:
        return 0.0

    # Short-term EWMA (last 7 days)
    short_ewma = daily_views[0]
    for v in daily_views[1:7]:
        short_ewma = alpha_short * v + (1 - alpha_short) * short_ewma

    # Long-term EWMA (all available days)
    long_ewma = daily_views[0]
    for v in daily_views[1:]:
        long_ewma = alpha_long * v + (1 - alpha_long) * long_ewma

    # Blend: 70% short-term responsiveness, 30% long-term stability
    forecast = 0.7 * short_ewma + 0.3 * long_ewma
    return max(0.0, forecast)

def apply_release_calendar_boost(
    base_forecast: float,
    days_until_release: int,
    title_historical_peak: float,
) -> float:
    """
    Applies a multiplicative boost for upcoming releases.
    Uses historical peak views as the ceiling for the boost.
    """
    if days_until_release > 7 or days_until_release < -1:
        return base_forecast

    # Release boost follows a gamma-like curve peaking at release day
    release_day_factor = {
        7: 1.2, 6: 1.3, 5: 1.5, 4: 1.8, 3: 2.5,
        2: 4.0, 1: 6.0, 0: 8.0, -1: 5.0
    }
    factor = release_day_factor.get(days_until_release, 1.0)
    boosted = base_forecast * factor

    # Cap at 1.2x historical peak to prevent over-allocation
    return min(boosted, title_historical_peak * 1.2)
Key Insight

The release calendar boost is the single highest-ROI intervention in the prefetch pipeline. Without it, a new season of a popular show would start its release day with a cold cache, because the EWMA forecast (trained on data when the title was not releasing) would not predict the demand spike. The 8x boost on release day is aggressive but empirically correct - popular new releases typically see 6-10x their pre-release traffic on day one.

Scaling and Performance

Capacity Estimation - Netflix CDN

Given:
  - 280 million subscribers globally
  - Peak concurrent streams: ~40 million (14% of subscribers during prime time)
  - Average bitrate: 8 Mbps (mix of HD and 4K streams)
  - Average segment size: 2 seconds at 8 Mbps = 2 MB per segment request
  - Segment request rate: 40M concurrent / 2s per segment = 20M segment requests/second

Bandwidth at edge:
  - 40M concurrent streams * 8 Mbps = 320 Tbps aggregate edge bandwidth
  - Per PoP (200 PoPs): average 1.6 Tbps; peak PoPs (NYC, LA, London) ~10 Tbps

Storage per edge PoP:
  - 100 TB NVMe per appliance, 4-10 appliances per PoP
  - 400 TB to 1 PB per PoP total cache
  - Netflix catalog: ~15,000 titles * 20 GB avg (all bitrates): ~300 TB
  - Top-500 titles cover ~90% of demand; top-500 * 20 GB = 10 TB - fits easily in 100 TB

Regional Cluster (Origin Shield):
  - 5 PB per cluster, 12 clusters globally = 60 PB warm tier
  - Netflix catalog in all bitrates ~1 EB (with 4K variants)
  - Clusters cache 1080p and below for all titles = ~100 TB catalog
  - 4K variants cached for top-5000 titles: 5000 * 60 GB = 300 TB
  - Total per cluster: ~400 TB well within 5 PB

Origin (S3) request rate:
  - Target: under 5% miss rate at edge PoPs
  - 5% of 20M req/s = 1M req/s miss rate to Regional Clusters
  - Regional Cluster hit rate target: 99% (only 1% reach S3)
  - S3 requests: 10,000 req/s - manageable with S3 request rate limits

Prefetch daily data volume:
  - 200 PoPs * average 10 TB nightly fill = 2 PB transferred nightly
  - Spread over 4-hour window at 2am: 2 PB / 4h = 139 GB/s aggregate fill bandwidth

The dominant scaling lever is the tiered caching hierarchy. Edge PoPs handle the vast majority of traffic using small, hot caches. Regional Clusters handle the long tail of content that doesn’t fit in Edge PoPs. This two-tier approach reduces S3 origin traffic by approximately 99.9%, meaning that Netflix’s compute bill for serving video bytes is dominated by the cost of running OCAs (which Netflix pays for via hardware procurement) rather than S3 egress (which would be enormous at this scale - 15 PB/day at $0.09/GB would cost $1.35M per day).

Real World

Netflix’s Open Connect program, launched publicly in 2012, resulted in measurable improvements in streaming quality that were reported in their ISP Speed Index. ISPs that hosted OCAs saw their Netflix ranking jump from the 20th percentile to the 95th percentile of streaming quality. By 2023, Open Connect served over 99% of Netflix traffic globally. The program effectively turned ISPs from Netflix’s bandwidth vendor into Netflix’s CDN partners, aligning economic incentives: the ISP saves transit costs, Netflix gets lower latency, users get better quality.

Failure Modes and Recovery

FailureDetectionImpactRecovery
OCA appliance disk failureSMART alerts + health check latency spike within 30sRequests re-route to sibling appliances in cluster; consistent hash ring updated to exclude failed nodeReplacement appliance auto-provisioned; cache refills from Regional Cluster during next fill window
PoP network partition (ISP peering down)Steering health checks fail; appliance stops responding within 10sAll users on that ISP routed to next-closest PoP; increased latency for affected usersISP peering restored by ISP NOC; Steering Service automatically re-routes once health checks pass
Cache fill storm overloads Regional ClusterCluster egress bandwidth exceeds 90% for 60sFill queue backs up; PoPs don’t receive scheduled prefetch contentFill queue implements exponential backoff; fills deprioritized to off-peak hours; forced release fills get queue priority
Content purge propagation failurePurge acknowledgment check finds PoP did not acknowledge within 60sPurged content continues to be served from non-acknowledging PoPSteering Service removes non-acknowledging PoP from rotation until purge is confirmed; PoP re-syncs purge log on reconnect
Cache Advisor produces corrupt fill listCanary PoP reports anomalous hit rate drop after applying new listTop-N fill predictions are wrong; cache hit rate falls below 92% thresholdFill list rollback to previous day’s list; alert to Cache Advisor team; manual review before next fill cycle
Consistent hash ring misconfigured (appliance added without ring update)Requests for some title hash ranges return 404 missCache miss storm for titles in orphaned hash rangeRing update applied and replicated to all OCAs within 30s; affected titles re-routed to correct appliance
Watch Out

The most dangerous operational mistake in CDN management is cache invalidation without verifying downstream PoP acknowledgment. An urgent content takedown that “succeeds” at the Steering Service level (the URL is removed from the manifest generator) but fails at 50 PoPs leaves those PoPs serving the content directly to any client that still has the old URL - including cached manifests. Production-grade invalidation requires a two-phase commit: issue the purge, then verify that every PoP in scope has acknowledged before declaring success.

Comparison of Approaches

ApproachCache Hit RateEviction QualityHandling Viral SpikesComplexityBest Fit
Pure LRU88-92%Poor (scan-sensitive)Fails (viral evicted by scans)LowGeneral-purpose web caches
Pure LFU90-93%Medium (recency blind)Fails (new content evicted)MediumStable access patterns
ARC (Adaptive Replacement)93-96%Good (auto-tunes)Medium (ghost lists help)HighMixed workloads without release cycles
Popularity-Weighted (Netflix)95-98%ExcellentExcellent (novelty bonus)HighVideo streaming with known release calendar
Prefetch-only (no eviction logic)Up to 99% if predict correctN/A (fill-driven)Good if predictedVery HighHighly predictable demand (live sports)

Netflix’s popularity-weighted approach wins for video streaming because it correctly handles both the high-frequency long-tail (classic series with steady viewership) and the viral spike case (new release explodes in demand overnight). The ARC algorithm is a strong general-purpose alternative, but it lacks the domain-specific release calendar boost that accounts for the largest demand spikes. For live sports or scheduled events where demand is almost perfectly predictable, pure prefetch with no eviction (hold the content for the event window and then flush) achieves near-100% hit rates but requires knowing the demand schedule upfront.

The key trade-off the comparison table doesn’t capture is storage efficiency per hit. LFU optimizes for maximizing hits on existing cached content; LRU optimizes for recency; popularity-weighted optimizes for expected future hits per GB of cache consumed. That last metric - future hits per GB - is the right one for a CDN where storage is the binding constraint.

Key Takeaways

  • Open Connect’s ISP co-location eliminates internet transit entirely for the last mile, reducing latency by 40-60ms compared to commercial CDN deployments that traverse public internet exchange points.
  • Consistent hashing in PoP clusters ensures that the same title always routes to the same appliance, so a single cache miss warms the cache for all subsequent requests rather than every request triggering a separate miss.
  • Cache Advisor nightly fill decouples the prediction problem (what will be popular tomorrow?) from the serving problem (is this content available now?), allowing expensive ML-based forecasting to run offline without adding latency to user requests.
  • Popularity-weighted eviction outperforms both LRU and LFU for video streaming because it captures both recency (active viewing cycle) and frequency (catalog depth) while penalizing large content that consumes disproportionate storage.
  • Origin shield (Regional Cluster) absorbs cache miss amplification from 200+ Edge PoPs into at most 15 simultaneous S3 requests, preventing new release traffic spikes from overwhelming origin.
  • Request coalescing at the Regional Cluster deduplicates concurrent misses for the same content into a single origin fetch, preventing a thundering herd at midnight-UTC global releases.
  • Cache invalidation requires two-phase acknowledgment: issuing a purge directive is not the same as verifying it was applied - production CDNs must track per-PoP acknowledgments and escalate non-responding PoPs.
  • URL versioning handles re-encodes gracefully without invalidation overhead: new content hash means new URL, so old and new coexist in cache without conflict until the old version’s eviction score drops to zero.

The deepest counter-intuitive lesson from this system: the hard problem in CDN design is not serving content fast - flash storage and HTTP/2 byte-range serving are commodity technology. The hard problem is the prediction engine that fills the cache with the right content before demand arrives. A CDN with perfect prefetch and naive serving code will outperform a CDN with perfect serving code and naive prefetch every time, because no serving optimization can compensate for a cache miss that sends the user’s request halfway around the world.

Frequently Asked Questions

Q: Why not use a commercial CDN like Akamai or Cloudflare instead of building Open Connect? A: Commercial CDNs charge per-GB bandwidth fees. At Netflix’s scale (15 PB/day), even at deeply negotiated rates of $0.01/GB, the daily cost would exceed $150,000 - $55M/year - for bandwidth alone. Open Connect’s cost is hardware procurement and co-location power, which amortizes over 3-5 years of appliance lifetime. Beyond cost, Netflix gets control over the prefetch pipeline and cache policy, which commercial CDNs do not expose. The ISP partnership model also creates a quality alignment that commercial CDNs cannot match: the ISP’s network performance directly affects Netflix’s streaming quality, giving the ISP an incentive to maintain peering health.

Q: Why is consistent hashing used within a PoP cluster rather than just replicating all content to all appliances? A: Full replication would require every miss-fetched segment to be written to all N appliances - N times the origin bandwidth and N times the fill write traffic. With 10 appliances in a cluster and 40Gbps fill bandwidth each, replication would consume 400Gbps of Regional Cluster egress just for the fill write. Consistent hashing reduces this to 1x the fill bandwidth. The trade-off is that a failed appliance creates a miss storm for all titles that mapped to it - but this is handled by consistent hashing’s graceful ring rebalancing, which remaps only the affected title range to the next healthy appliance.

Q: How does Netflix handle the cold start problem for a completely new title with no viewing history? A: New titles are categorized by type (movie vs. series), genre, and country of origin. The Cache Advisor looks up historical demand curves for titles with similar attributes (genre + production country + release type) and uses that as a proxy baseline for the first 48 hours. The release calendar boost ensures that even a brand-new title with no history gets prefilled to relevant PoPs if it has a scheduled release date. After 24 hours of real view data, the EWMA forecast takes over from the proxy baseline.

Q: Why not cache at the user’s device (aggressive browser/app caching) and reduce CDN load? A: Device caching works for small repeated assets (thumbnails, UI assets) but not for video segments. A 2-second video segment at 1080p is 2-4 MB. Prefetching even one minute of video requires 60-120 MB of device storage and 60-120 MB of network bandwidth - and the prefetched segment may be the wrong bitrate (if network conditions change) or a segment the user never watches (if they skip forward). Netflix’s adaptive bitrate algorithm already buffers 30-60 seconds of video in the player, which is the practical limit of useful prefetch without prediction of future user behavior.

Q: How does the system maintain cache consistency when an episode is re-encoded with better compression? A: Re-encoded episodes get a new content_hash in their URL. Old URLs served from cache continue working until their eviction score drops to zero. New requests are served the new URL from the updated manifest. The transition is gradual - some users stream from old cache, some from new cache, but both are valid encodings of the same content. For urgent re-encodes (quality bug in a released segment), Netflix can accelerate the transition by artificially setting old segment eviction scores to zero, triggering immediate eviction and forcing refill with the new encoding.

Q: What happens when the Cache Advisor predicts wrong and a surprise viral title is not cached? A: Cache misses escalate to the Regional Cluster, which has a broader catalog. The Regional Cluster absorbs the initial miss storm. Simultaneously, the real-time monitoring pipeline detects the anomalous hit rate on the Regional Cluster for that title and triggers an emergency prefetch push to Edge PoPs outside the normal nightly fill window. This “hot fill” mechanism pushes the viral title to the top of every relevant PoP’s fill queue, completing cache warm-up within 30-60 minutes of the demand spike being detected - fast enough to serve the second wave of demand from cache.

Interview Questions

Q: How would you design the consistent hashing ring to minimize disruption when adding a new appliance to a busy PoP cluster? Expected depth: Discuss virtual node count (150 virtual nodes per appliance means new appliance only displaces 1/N of all titles), the rebalancing protocol (read from old appliance during transition, write to new appliance to warm it), and why you need a brief dual-serving window where both old and new appliances respond to the rebalancing range to avoid a miss storm. Cover the trade-off between rebalancing speed (fast = less inconsistency window) and write amplification (slow = less origin stress).

Q: Walk me through how you would handle a global content takedown (legal order) that must complete in under 5 minutes across 200+ PoPs. Expected depth: Discuss the two-phase approach: first, update the Steering Service to stop routing new requests to the purged URL (takes effect immediately for new streams); second, push an explicit purge directive to all OCAs via a Kafka topic with per-PoP acknowledgment tracking. Cover what happens to in-flight streams (they continue until the segment buffer is exhausted, then the next segment request gets a 410 response). Discuss the verification gap: you declare success only when 100% of PoPs have acknowledged, with escalation for non-acknowledging PoPs (remove from rotation, force purge on reconnect).

Q: How would you redesign the Cache Advisor to reduce the prefetch miss rate for viral content? Expected depth: Discuss adding a real-time event detection path (social media API polling every 5 minutes rather than daily batch) that can trigger emergency mid-day fills. Cover the trade-off between fill frequency (more fills = fresher cache = higher hit rate) and fill cost (every mid-day fill consumes OCAs’ disk write bandwidth that could be serving user traffic). Discuss using Twitter/Reddit mention velocity as a leading indicator of demand spikes, typically 30-90 minutes ahead of the viewing spike.

Q: Design the request coalescing mechanism at the Regional Cluster to prevent thundering herd on new releases. Expected depth: Discuss the “cache lock” pattern: the first miss for a segment acquires a per-segment lock, initiates the S3 fetch, and queues all subsequent misses as waiters on that lock. When the fetch completes, all waiters are served from the now-filled cache. Cover the failure mode where the S3 fetch fails (the lock must time out and release all waiters with an error, not hold them forever). Discuss in-memory vs. Redis-based lock storage, and why in-memory is preferred for a single-cluster coalescing layer (no distributed lock coordination overhead, and the lock only needs to survive within the cluster’s request handling).

Q: How would you monitor cache health across 200+ PoPs and detect a failing Cache Advisor before users experience buffering? Expected depth: Discuss the key metric hierarchy: per-PoP cache hit rate (target 95%), time-to-first-byte (target p99 under 50ms), and Regional Cluster request rate (should be stable; a spike indicates widespread Edge PoP misses). Cover leading indicators vs. lagging indicators: hit rate drop is a lagging indicator (users are already missing); Regional Cluster request rate spike is a leading indicator (misses are happening before users see buffering). Discuss anomaly detection - using EWMA on the Regional Cluster request rate to detect spikes that are 2 standard deviations above baseline, triggering an emergency hot fill rather than waiting for user complaints.

Premium Content

Unlock the full article along with everything else in the archive — all in one place.

In-depth analysis Expert insights Full archive access
Unlock Full Article