Vetora logo
⏱️Foundations

Latency Numbers Every Engineer Should Know

A 2026-updated reference of the latency numbers every engineer should internalize -- from L1 cache hits to cross-continent round trips. These numbers form the foundation of back-of-envelope capacity estimation and inform every level of system design.

Overview

Jeff Dean's 'Latency Numbers Every Programmer Should Know' has been a cornerstone reference since his 2009 Stanford talk. The specific numbers have changed as hardware has evolved, but the relative orders of magnitude remain remarkably stable. Understanding these numbers is essential for system design because they determine where bottlenecks occur, how much caching helps, and whether a proposed architecture can meet its latency budget.

At the hardware level (2026 numbers): an L1 cache reference takes approximately 1 nanosecond, an L2 reference about 4ns, and an L3 reference about 12ns. Main memory (DDR5) access takes roughly 80-100ns. NVMe SSD random reads take 10-20 microseconds (us), while SATA SSD random reads are 50-100us. HDD seek time remains around 4 milliseconds -- nearly 400x slower than NVMe for random access. CXL-attached memory (emerging in 2025-2026) offers a new tier at roughly 200-400ns, bridging the gap between local DRAM and remote memory. RDMA (Remote Direct Memory Access) in data centers provides memory-to-memory transfers at 1-5us, bypassing the kernel network stack entirely.

At the network level: intra-rack communication takes approximately 100-200us round trip. Intra-datacenter (cross-rack) round trips are about 300-500us. Cross-availability-zone round trips (within the same region) are 1-2ms. Cross-region round trips within a continent (US East to US West) are 40-70ms. Intercontinental round trips (US to Europe) are 80-120ms, and US to Asia-Pacific is 150-200ms. These network latencies are dominated by the speed of light in fiber optic cable (~200,000 km/s), which sets a hard physical floor that no software optimization can break.

The back-of-envelope calculation technique applies these numbers to estimate system performance before building anything. If your API endpoint makes 3 sequential database calls (each hitting NVMe SSD: ~0.2ms with query overhead), 1 cache lookup (Redis over network: ~0.5ms), and 1 call to an external service (cross-AZ: ~2ms), the total is approximately 3.1ms. If the latency budget is 50ms, you have significant headroom. If you need to make those 3 database calls to a cross-region replica instead, each adds ~50ms, blowing the budget at 153ms. This kind of rapid estimation separates senior engineers from those who build first and optimize later.

Key Points
  • 1L1 cache (~1ns) to main memory (~100ns) is a 100x gap. L1 to NVMe SSD (~10us) is a 10,000x gap. NVMe SSD to HDD (~4ms) is a 400x gap. These order-of-magnitude differences determine where caching has the most impact.
  • 2Network latency has a hard physical floor: the speed of light in fiber. US East to US West is ~40ms minimum, and no amount of software optimization can reduce it. System designs that require synchronous cross-region calls are constrained by physics.
  • 3NVMe SSDs (10-20us random read) are roughly 5-10x faster than SATA SSDs (50-100us) and 400x faster than HDDs (4ms). This difference matters enormously for database storage engine choices and affects whether indexes fit in faster or slower storage tiers.
  • 4Intra-datacenter round trips (~0.5ms) are 100x faster than cross-region round trips (~50ms). This is why co-locating services that communicate frequently in the same region is critical, and why cross-region synchronous replication adds significant latency.
  • 5Serialization and deserialization overhead is often underestimated. Converting a 1KB JSON payload takes 5-50us depending on the parser. For high-throughput services processing millions of requests, switching to binary formats like Protocol Buffers (1-5us) can save significant CPU time.
  • 6Back-of-envelope estimation means summing the latency of each step in a request path. If the total exceeds the latency budget, you need to reduce sequential steps (parallelize), move data closer (cache or replicate), or relax consistency requirements (read from local replicas).
Simple Example

The Speed of Everyday Actions Analogy

To internalize latency scales, map them to human timescales. If an L1 cache reference (1ns) took 1 second, then: an L2 reference would take 4 seconds, main memory access would take 1.5 minutes, an NVMe SSD read would take 3 hours, an HDD seek would take 46 days, an intra-DC network call would take 6 days, a cross-AZ call would take 23 days, and a cross-continent round trip would take 3.8 years. This scaling makes it viscerally clear why hitting the network cache (Redis) instead of disk, or reading from a local replica instead of a cross-region primary, transforms the user experience.

Real-World Examples

Discord

Discord decomposes its latency budget for message delivery: 5ms for the API gateway to parse and route the message, 2ms for a Redis cache lookup of channel permissions, 5ms for Cassandra write (NVMe-backed), and 3ms for fan-out to connected WebSocket sessions. The total target is under 50ms end-to-end. Each component was designed with specific latency numbers in mind -- choosing Redis over a database for permissions (0.5ms vs 5ms) and NVMe-backed Cassandra over HDD-backed storage (5ms vs 20ms).

Cloudflare Workers

Cloudflare Workers execute at the edge, within 50ms of 95% of the world's internet-connected population. By running application code at 300+ edge locations instead of a central cloud region, they eliminate the 40-200ms cross-region latency that dominates traditional architectures. A Worker reading from Cloudflare's KV store at the edge adds 10-20ms, compared to 50-150ms for a cross-region database call. This edge-first approach is a direct application of latency-number awareness.

Google Search

Google's search results page must load in under 200ms total. This budget is decomposed across: query parsing (1ms), index lookup across distributed servers (10-30ms), ad auction (10-20ms), snippet generation (5-10ms), result serialization and rendering (5ms), and network round trip to the user (10-100ms depending on location). The index is kept entirely in memory (100ns per lookup vs 10us for SSD) across thousands of servers, because even the NVMe SSD latency would blow the budget at the query volume Google handles.

Trade-Offs
AspectDescription
Memory vs Storage CostKeeping data in RAM (100ns access) instead of NVMe SSD (10us) provides a 100x latency reduction but costs roughly 10-30x more per GB. The sweet spot depends on the dataset's access pattern: hot data in memory, warm data on NVMe, cold data on HDD or object storage. Tiered storage strategies optimize this cost-latency trade-off.
Caching vs FreshnessCaching moves data closer to the compute (L1 > L2 > Redis > database), dramatically reducing latency. But cached data can be stale. A Redis cache with a 60-second TTL means users may see data up to 60 seconds old. Shorter TTLs reduce staleness but increase cache misses and backend load.
Co-location vs RedundancyCo-locating services in the same datacenter minimizes latency (~0.5ms) but creates a single point of geographic failure. Distributing across regions adds latency (40-200ms per cross-region call) but provides disaster recovery. The trade-off is between performance and resilience.
Synchronous vs Asynchronous CommunicationSynchronous cross-service calls add their latency directly to the request path. Asynchronous messaging (via queues) decouples latency but introduces complexity in request-response patterns. If a user needs an immediate response, the synchronous path must fit within the latency budget; background processing can be asynchronous.
Case Study

Amazon's Latency Budgeting for Product Pages

Scenario

Amazon discovered that every 100ms of added latency to product page load times reduced sales by 1%. A typical product page requires data from over 150 microservices: product details, pricing, inventory, recommendations, reviews, seller information, shipping estimates, and ad placements. Calling each service sequentially, even at 5ms per call, would take 750ms -- far exceeding the 200ms render budget.

Solution

Amazon implemented a latency budget framework where each service call has a strict timeout (typically 20-50ms). The product page orchestrator parallelizes independent service calls (recommendations, reviews, and ads fetch simultaneously), reducing the critical path from 150 sequential calls to approximately 5 sequential stages of parallel calls. Services that exceed their latency budget are gracefully degraded -- a slow recommendation service returns a cached result instead of blocking the page. Each service's latency is decomposed against the hardware numbers: Redis cache (0.5ms) for hot data, DynamoDB (2-5ms) for lookups, and pre-computed data for anything that would require cross-service joins.

Outcome

Amazon's product pages consistently render in under 200ms for the median user, with p99 under 500ms. The latency budget framework made latency a first-class design constraint: every new service integration must declare its expected latency and fallback behavior. Services that introduce more than 10ms of critical-path latency require architectural review. This discipline, rooted in understanding latency numbers, directly translates to billions of dollars in revenue retention.

Common Mistakes
  • Memorizing exact numbers instead of orders of magnitude. The specific numbers change as hardware evolves, but the relative gaps (memory is 100x faster than SSD, SSD is 400x faster than HDD, intra-DC is 100x faster than cross-region) remain stable. Focus on the gaps, not the digits.
  • Ignoring tail latency. The p50 latency for an NVMe SSD read might be 10us, but the p99.9 can spike to 1ms during garbage collection or write amplification. Systems that depend on consistent low-latency storage must account for these tail events, especially in fan-out architectures where you wait for the slowest of N calls.
  • Underestimating serialization costs. JSON parsing a 10KB payload takes 50-200us -- comparable to multiple SSD reads. For high-throughput services, serialization format (JSON vs Protobuf vs FlatBuffers) can be a significant portion of total request latency.
  • Assuming cloud latency equals physical latency. Cloud provider networks add overhead for virtualization, security groups, NAT gateways, and load balancers. A cross-AZ call that should take 1ms based on physical distance may take 2-3ms after cloud networking overhead. Always measure, do not assume.
Related Concepts

See Latency Numbers Every Engineer Should Know in action

Explore system design templates that use latency numbers every engineer should know and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Watch how network latency cascades through a URL shortener

Metrics to watch
p99_latency_msthroughput_rps
Run Simulation
Test Your Understanding

1Approximately how much faster is an NVMe SSD random read compared to an HDD seek in 2026?

2What sets the hard physical floor for cross-region network latency?

3A system design requires 5 sequential cross-region calls (US East to US West, ~50ms each). What is the approximate total latency contribution?

Deeper Reading