What is important about Tiered Caching (L1/L2/CDN) regarding "L1 (in-process): Caffeine, Guava, or in-memory maps. Microse..."?

L1 (in-process): Caffeine, Guava, or in-memory maps. Microsecond latency (~0.1-1 us). Small capacity (100MB-2GB). Local to each application instance. Best for extremely hot data accessed thousands of times per second per instance.

What is important about Tiered Caching (L1/L2/CDN) regarding "L2 (distributed): Redis, Memcached. Millisecond latency (~0...."?

L2 (distributed): Redis, Memcached. Millisecond latency (~0.5-2ms). Large capacity (terabytes across a cluster). Shared across all instances. The workhorse tier for most caching needs.

What is important about Tiered Caching (L1/L2/CDN) regarding "L3 (CDN): Cloudflare, Akamai, CloudFront. Global reach. Mass..."?

L3 (CDN): Cloudflare, Akamai, CloudFront. Global reach. Massive capacity. Latency depends on geographic distance (1-50ms to nearest edge). Best for static assets, public API responses, and content that does not require authentication.

What is important about Tiered Caching (L1/L2/CDN) regarding "Request flow cascades: L1 -> L2 -> L3 -> origin. Data found ..."?

Request flow cascades: L1 -> L2 -> L3 -> origin. Data found at any tier is returned and optionally promoted to higher tiers. A miss at all tiers triggers an origin fetch and populates all tiers.

What is important about Tiered Caching (L1/L2/CDN) regarding "Cache coherence across tiers uses TTL cascading (L1: 30s, L2..."?

Cache coherence across tiers uses TTL cascading (L1: 30s, L2: 300s, L3: 3600s) and/or event-driven invalidation (pub/sub). Shorter TTLs at lower tiers ensure L1 data is refreshed more frequently, bounding inconsistency.

What is important about Tiered Caching (L1/L2/CDN) regarding "Add tiers incrementally based on measured need. Most systems..."?

Add tiers incrementally based on measured need. Most systems need only L2 (Redis). Add L1 when Redis round trips are a bottleneck. Add CDN when global distribution is required. Each tier adds complexity that must be justified by performance data.

Vetora

🏗️Caching

Tiered Caching (L1/L2/CDN)

Tiered caching uses multiple cache layers with different latency, capacity, and cost characteristics. L1 (in-process) provides microsecond access, L2 (distributed) provides millisecond access at scale, and L3 (CDN) provides global reach. Requests flow through tiers until data is found or the origin is reached.

Overview

Tiered caching, also called multi-layer or hierarchical caching, arranges multiple cache layers in a hierarchy where each tier offers different trade-offs between latency, capacity, and scope. The closest tier (L1, in-process) provides the fastest access but has the smallest capacity and is local to a single application instance. The next tier (L2, distributed) is shared across all instances and offers much larger capacity at slightly higher latency. The outermost tier (L3, CDN) provides global reach with massive capacity but is limited to content that can be served without application logic.

L1 caches run in the application process itself -- libraries like Caffeine (JVM), Guava Cache (JVM), or simple in-memory maps. Access latency is measured in microseconds (0.1-1 us) because there is no network round trip. However, L1 caches are limited by the application's heap size (typically 100MB-2GB) and are not shared across instances. Each application server has its own L1 cache, which means the same data may be cached N times across N servers, and cache coherence is harder to maintain. L2 caches are distributed caching systems like Redis or Memcached that sit on dedicated servers accessible over the network. Access latency is 0.5-2ms (network round trip), but capacity can be terabytes across a cluster. L2 caches are shared across all application instances, providing a single source of cached truth.

The request flow in a tiered architecture is straightforward: check L1 first, then L2, then L3 (if applicable), and finally the origin (database or API). When data is found at any tier, it is returned to the caller and optionally promoted to higher tiers. For example, an L2 hit might populate the L1 cache so subsequent requests from the same instance get microsecond access. Cache coherence across tiers is the primary challenge. When data changes, all tiers must be invalidated or updated. Common approaches include pub/sub invalidation (Redis pub/sub or Kafka events notify all L1 caches to evict a key), TTL cascading (L1 uses shorter TTLs than L2, which uses shorter TTLs than L3), and write-through at each tier.

Not every system needs tiered caching. A single Redis cache (L2 only) is sufficient for most applications. Add L1 when network round trips to Redis dominate your latency budget and you can tolerate brief inconsistency between instances. Add CDN (L3) when you serve content globally and the origin cannot handle the aggregate request volume from all edge locations. Each tier adds operational complexity: L1 requires coherence management, L2 requires cluster operations, and CDN requires cache purge workflows. Start with one tier and add layers only when measured performance data justifies the complexity.

Key Points

1L1 (in-process): Caffeine, Guava, or in-memory maps. Microsecond latency (~0.1-1 us). Small capacity (100MB-2GB). Local to each application instance. Best for extremely hot data accessed thousands of times per second per instance.
2L2 (distributed): Redis, Memcached. Millisecond latency (~0.5-2ms). Large capacity (terabytes across a cluster). Shared across all instances. The workhorse tier for most caching needs.
3L3 (CDN): Cloudflare, Akamai, CloudFront. Global reach. Massive capacity. Latency depends on geographic distance (1-50ms to nearest edge). Best for static assets, public API responses, and content that does not require authentication.
4Request flow cascades: L1 -> L2 -> L3 -> origin. Data found at any tier is returned and optionally promoted to higher tiers. A miss at all tiers triggers an origin fetch and populates all tiers.
5Cache coherence across tiers uses TTL cascading (L1: 30s, L2: 300s, L3: 3600s) and/or event-driven invalidation (pub/sub). Shorter TTLs at lower tiers ensure L1 data is refreshed more frequently, bounding inconsistency.
6Add tiers incrementally based on measured need. Most systems need only L2 (Redis). Add L1 when Redis round trips are a bottleneck. Add CDN when global distribution is required. Each tier adds complexity that must be justified by performance data.

Simple Example

The Office Supply Closet

Think of caching tiers like office supplies. Your desk drawer (L1) has a few pens and sticky notes -- instant access but limited space. The supply closet down the hall (L2) has more of everything -- a quick walk to get what you need. The warehouse across town (L3/CDN) has every supply the company stocks -- takes a delivery truck to get things, but capacity is enormous. The factory (origin/database) can make anything but takes the longest. You check your desk first, then the closet, then order from the warehouse, and only call the factory as a last resort. Most days, your desk and the closet have everything you need.

Real-World Examples

Netflix (EVCache)

Netflix uses EVCache, a tiered caching system built on top of Memcached. L1 is an in-process cache within each microservice instance, holding the hottest data (subscriber profiles, active viewing sessions) with a 30-second TTL. L2 is a distributed Memcached cluster (EVCache) shared across all instances in a region, with 5-minute TTLs. Netflix replicates EVCache data across AWS regions for disaster recovery. The L1 tier handles approximately 80% of cache requests without a network round trip.

Facebook (TAO)

Facebook's TAO (The Associations and Objects) cache is a tier-aware caching system for the social graph. The leader cache tier (close to the database) handles writes and serves as the source of truth for cache data. Follower cache tiers (distributed globally) serve reads from cached data, with invalidation propagated from leaders to followers via a dedicated invalidation pipeline. This tiered approach allows Facebook to serve trillions of social graph queries per day from cache, with only a small fraction reaching the MySQL databases.

Cloudflare (Tiered Caching)

Cloudflare's tiered caching adds a regional 'upper tier' between its edge data centers (200+ cities) and the customer's origin server. Instead of each edge independently fetching from the origin on a cache miss, misses are routed to a regional tier first. If the regional tier has the content, the edge is served without touching the origin. This reduces origin requests by 90%+ for popular content. Cloudflare also supports Argo Tiered Caching, which adds additional tiers for customers needing maximum origin offload.

Trade-Offs

Aspect	Description
Latency Reduction vs Coherence Complexity	Each additional cache tier reduces average latency (L1 hits avoid network entirely), but maintaining coherence across tiers requires invalidation mechanisms (pub/sub, event streams) or TTL cascading. A bug in coherence logic can cause persistent stale data that is difficult to diagnose because it appears correct in the database and L2 but wrong in specific L1 instances.
Memory Efficiency vs Duplication	L1 caches are per-instance, so the same data is duplicated across all application instances. With 50 instances, the top 1,000 entries are stored 50 times in L1 caches plus once in L2. This duplication is the cost of microsecond L1 access. Limiting L1 to only the hottest data (top 1-5% of keys by request rate) minimizes waste.
Operational Complexity vs Performance Benefit	Each tier requires separate monitoring, alerting, capacity planning, and failure handling. L1 failures are invisible (handled by L2 fallback), but L2 cluster operations (scaling, failover, upgrades) affect all application instances. CDN cache purge workflows add deployment complexity. The performance benefit must justify the operational burden for each tier.
Cold Start Impact vs Steady-State Performance	Multi-tier caching amplifies cold start effects. After a deployment or restart, L1 caches are empty across all instances, pushing all requests to L2. If L2 was also cleared (cluster upgrade), everything hits the origin. Cache warming at each tier mitigates this but adds startup time and complexity. Steady-state performance is excellent, but transitions are vulnerable.

Case Study

Netflix EVCache: Two-Tier Caching for 200+ Million Subscribers

Scenario

Netflix serves over 200 million subscribers globally, with each user session requiring dozens of cache lookups per second for personalization, recommendations, and content metadata. A single distributed cache tier (Memcached) handled the load but added 0.5-2ms of network latency per lookup. For a page load requiring 50 cache lookups, the aggregate cache latency was 25-100ms -- a significant fraction of the total response time.

Solution

Netflix added an L1 in-process cache within each microservice instance, creating a two-tier architecture. The L1 cache (backed by a simple ConcurrentHashMap with size-based eviction) stored the hottest 10,000 entries per instance with a 30-second TTL. L1 hits (microsecond latency) eliminated the network round trip to EVCache (L2). Cache invalidation used a pub/sub mechanism: when data changed in the backend, an invalidation event was published to all L1 caches via a lightweight event bus. The short L1 TTL ensured that even if an invalidation was missed, stale data would expire within 30 seconds.

Outcome

The L1 tier absorbed approximately 80% of cache requests, reducing EVCache (L2) traffic by 5x. The average cache access latency dropped from 1ms (network round trip) to 0.01ms (in-process lookup). Page load times improved by 15-25ms on average. The L2 cluster could be scaled down by 40% because it no longer needed to handle the full request volume. The 30-second L1 TTL was short enough that stale data had minimal user impact -- personalization data that is 30 seconds old is effectively identical to current data for recommendation quality.

Common Mistakes

⚠Adding tiers prematurely. A single Redis cluster handles millions of requests per second. Adding L1 caching before Redis becomes a bottleneck introduces coherence complexity without meaningful benefit. Profile first: if Redis round trips are less than 10% of your request latency, L1 is not needed yet.
⚠Using the same TTL across all tiers. L1 should have shorter TTLs (10-60 seconds) than L2 (1-15 minutes) to bound staleness in per-instance caches. If L1 and L2 have the same 5-minute TTL, L1 can serve data that is 5 minutes stale even though L2 was refreshed 4 minutes ago.
⚠Not implementing invalidation for L1 caches. Relying solely on TTL for L1 coherence means data can be stale for the full TTL duration. For user-facing mutations (profile updates, order status changes), add event-driven L1 invalidation to provide immediate consistency.
⚠Caching too much at the CDN layer. CDN caching is powerful but difficult to invalidate (cache purge delays, geographic propagation). Only cache content at the CDN that changes infrequently (static assets, public pages) or that can tolerate staleness (product listings, blog posts). User-specific or frequently changing data should not reach the CDN layer.

Related Concepts

Cache-Aside (Lazy Loading)Eviction Policies TTL Strategies Where to Cache Latency Numbers Every Engineer Should Know

See Tiered Caching (L1/L2/CDN) in action

Explore system design templates that use tiered caching (l1/l2/cdn) and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate L1/L2 tiered caching hit ratios

Metrics to watch

l1_hit_ratiol2_hit_ratioorigin_load_pctp99_latency_ms

Run Simulation

Test Your Understanding

1What is the primary advantage of L1 (in-process) caching over L2 (distributed) caching?

2Why should L1 caches use shorter TTLs than L2 caches in a tiered architecture?

3In a tiered caching system, what happens when data is found in L2 but not in L1?

Deeper Reading