1What is the primary advantage of L1 (in-process) caching over L2 (distributed) caching?
Tiered caching uses multiple cache layers with different latency, capacity, and cost characteristics. L1 (in-process) provides microsecond access, L2 (distributed) provides millisecond access at scale, and L3 (CDN) provides global reach. Requests flow through tiers until data is found or the origin is reached.
Tiered caching, also called multi-layer or hierarchical caching, arranges multiple cache layers in a hierarchy where each tier offers different trade-offs between latency, capacity, and scope. The closest tier (L1, in-process) provides the fastest access but has the smallest capacity and is local to a single application instance. The next tier (L2, distributed) is shared across all instances and offers much larger capacity at slightly higher latency. The outermost tier (L3, CDN) provides global reach with massive capacity but is limited to content that can be served without application logic.
L1 caches run in the application process itself -- libraries like Caffeine (JVM), Guava Cache (JVM), or simple in-memory maps. Access latency is measured in microseconds (0.1-1 us) because there is no network round trip. However, L1 caches are limited by the application's heap size (typically 100MB-2GB) and are not shared across instances. Each application server has its own L1 cache, which means the same data may be cached N times across N servers, and cache coherence is harder to maintain. L2 caches are distributed caching systems like Redis or Memcached that sit on dedicated servers accessible over the network. Access latency is 0.5-2ms (network round trip), but capacity can be terabytes across a cluster. L2 caches are shared across all application instances, providing a single source of cached truth.
The request flow in a tiered architecture is straightforward: check L1 first, then L2, then L3 (if applicable), and finally the origin (database or API). When data is found at any tier, it is returned to the caller and optionally promoted to higher tiers. For example, an L2 hit might populate the L1 cache so subsequent requests from the same instance get microsecond access. Cache coherence across tiers is the primary challenge. When data changes, all tiers must be invalidated or updated. Common approaches include pub/sub invalidation (Redis pub/sub or Kafka events notify all L1 caches to evict a key), TTL cascading (L1 uses shorter TTLs than L2, which uses shorter TTLs than L3), and write-through at each tier.
Not every system needs tiered caching. A single Redis cache (L2 only) is sufficient for most applications. Add L1 when network round trips to Redis dominate your latency budget and you can tolerate brief inconsistency between instances. Add CDN (L3) when you serve content globally and the origin cannot handle the aggregate request volume from all edge locations. Each tier adds operational complexity: L1 requires coherence management, L2 requires cluster operations, and CDN requires cache purge workflows. Start with one tier and add layers only when measured performance data justifies the complexity.
The Office Supply Closet
Think of caching tiers like office supplies. Your desk drawer (L1) has a few pens and sticky notes -- instant access but limited space. The supply closet down the hall (L2) has more of everything -- a quick walk to get what you need. The warehouse across town (L3/CDN) has every supply the company stocks -- takes a delivery truck to get things, but capacity is enormous. The factory (origin/database) can make anything but takes the longest. You check your desk first, then the closet, then order from the warehouse, and only call the factory as a last resort. Most days, your desk and the closet have everything you need.
Netflix (EVCache)
Netflix uses EVCache, a tiered caching system built on top of Memcached. L1 is an in-process cache within each microservice instance, holding the hottest data (subscriber profiles, active viewing sessions) with a 30-second TTL. L2 is a distributed Memcached cluster (EVCache) shared across all instances in a region, with 5-minute TTLs. Netflix replicates EVCache data across AWS regions for disaster recovery. The L1 tier handles approximately 80% of cache requests without a network round trip.
Facebook (TAO)
Facebook's TAO (The Associations and Objects) cache is a tier-aware caching system for the social graph. The leader cache tier (close to the database) handles writes and serves as the source of truth for cache data. Follower cache tiers (distributed globally) serve reads from cached data, with invalidation propagated from leaders to followers via a dedicated invalidation pipeline. This tiered approach allows Facebook to serve trillions of social graph queries per day from cache, with only a small fraction reaching the MySQL databases.
Cloudflare (Tiered Caching)
Cloudflare's tiered caching adds a regional 'upper tier' between its edge data centers (200+ cities) and the customer's origin server. Instead of each edge independently fetching from the origin on a cache miss, misses are routed to a regional tier first. If the regional tier has the content, the edge is served without touching the origin. This reduces origin requests by 90%+ for popular content. Cloudflare also supports Argo Tiered Caching, which adds additional tiers for customers needing maximum origin offload.
| Aspect | Description |
|---|---|
| Latency Reduction vs Coherence Complexity | Each additional cache tier reduces average latency (L1 hits avoid network entirely), but maintaining coherence across tiers requires invalidation mechanisms (pub/sub, event streams) or TTL cascading. A bug in coherence logic can cause persistent stale data that is difficult to diagnose because it appears correct in the database and L2 but wrong in specific L1 instances. |
| Memory Efficiency vs Duplication | L1 caches are per-instance, so the same data is duplicated across all application instances. With 50 instances, the top 1,000 entries are stored 50 times in L1 caches plus once in L2. This duplication is the cost of microsecond L1 access. Limiting L1 to only the hottest data (top 1-5% of keys by request rate) minimizes waste. |
| Operational Complexity vs Performance Benefit | Each tier requires separate monitoring, alerting, capacity planning, and failure handling. L1 failures are invisible (handled by L2 fallback), but L2 cluster operations (scaling, failover, upgrades) affect all application instances. CDN cache purge workflows add deployment complexity. The performance benefit must justify the operational burden for each tier. |
| Cold Start Impact vs Steady-State Performance | Multi-tier caching amplifies cold start effects. After a deployment or restart, L1 caches are empty across all instances, pushing all requests to L2. If L2 was also cleared (cluster upgrade), everything hits the origin. Cache warming at each tier mitigates this but adds startup time and complexity. Steady-state performance is excellent, but transitions are vulnerable. |
Netflix EVCache: Two-Tier Caching for 200+ Million Subscribers
Scenario
Netflix serves over 200 million subscribers globally, with each user session requiring dozens of cache lookups per second for personalization, recommendations, and content metadata. A single distributed cache tier (Memcached) handled the load but added 0.5-2ms of network latency per lookup. For a page load requiring 50 cache lookups, the aggregate cache latency was 25-100ms -- a significant fraction of the total response time.
Solution
Netflix added an L1 in-process cache within each microservice instance, creating a two-tier architecture. The L1 cache (backed by a simple ConcurrentHashMap with size-based eviction) stored the hottest 10,000 entries per instance with a 30-second TTL. L1 hits (microsecond latency) eliminated the network round trip to EVCache (L2). Cache invalidation used a pub/sub mechanism: when data changed in the backend, an invalidation event was published to all L1 caches via a lightweight event bus. The short L1 TTL ensured that even if an invalidation was missed, stale data would expire within 30 seconds.
Outcome
The L1 tier absorbed approximately 80% of cache requests, reducing EVCache (L2) traffic by 5x. The average cache access latency dropped from 1ms (network round trip) to 0.01ms (in-process lookup). Page load times improved by 15-25ms on average. The L2 cluster could be scaled down by 40% because it no longer needed to handle the full request volume. The 30-second L1 TTL was short enough that stale data had minimal user impact -- personalization data that is 30 seconds old is effectively identical to current data for recommendation quality.
See Tiered Caching (L1/L2/CDN) in action
Explore system design templates that use tiered caching (l1/l2/cdn) and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is the primary advantage of L1 (in-process) caching over L2 (distributed) caching?
2Why should L1 caches use shorter TTLs than L2 caches in a tiered architecture?
3In a tiered caching system, what happens when data is found in L2 but not in L1?