1Approximately how much faster is an NVMe SSD random read compared to an HDD seek in 2026?
A 2026-updated reference of the latency numbers every engineer should internalize -- from L1 cache hits to cross-continent round trips. These numbers form the foundation of back-of-envelope capacity estimation and inform every level of system design.
Jeff Dean's 'Latency Numbers Every Programmer Should Know' has been a cornerstone reference since his 2009 Stanford talk. The specific numbers have changed as hardware has evolved, but the relative orders of magnitude remain remarkably stable. Understanding these numbers is essential for system design because they determine where bottlenecks occur, how much caching helps, and whether a proposed architecture can meet its latency budget.
At the hardware level (2026 numbers): an L1 cache reference takes approximately 1 nanosecond, an L2 reference about 4ns, and an L3 reference about 12ns. Main memory (DDR5) access takes roughly 80-100ns. NVMe SSD random reads take 10-20 microseconds (us), while SATA SSD random reads are 50-100us. HDD seek time remains around 4 milliseconds -- nearly 400x slower than NVMe for random access. CXL-attached memory (emerging in 2025-2026) offers a new tier at roughly 200-400ns, bridging the gap between local DRAM and remote memory. RDMA (Remote Direct Memory Access) in data centers provides memory-to-memory transfers at 1-5us, bypassing the kernel network stack entirely.
At the network level: intra-rack communication takes approximately 100-200us round trip. Intra-datacenter (cross-rack) round trips are about 300-500us. Cross-availability-zone round trips (within the same region) are 1-2ms. Cross-region round trips within a continent (US East to US West) are 40-70ms. Intercontinental round trips (US to Europe) are 80-120ms, and US to Asia-Pacific is 150-200ms. These network latencies are dominated by the speed of light in fiber optic cable (~200,000 km/s), which sets a hard physical floor that no software optimization can break.
The back-of-envelope calculation technique applies these numbers to estimate system performance before building anything. If your API endpoint makes 3 sequential database calls (each hitting NVMe SSD: ~0.2ms with query overhead), 1 cache lookup (Redis over network: ~0.5ms), and 1 call to an external service (cross-AZ: ~2ms), the total is approximately 3.1ms. If the latency budget is 50ms, you have significant headroom. If you need to make those 3 database calls to a cross-region replica instead, each adds ~50ms, blowing the budget at 153ms. This kind of rapid estimation separates senior engineers from those who build first and optimize later.
The Speed of Everyday Actions Analogy
To internalize latency scales, map them to human timescales. If an L1 cache reference (1ns) took 1 second, then: an L2 reference would take 4 seconds, main memory access would take 1.5 minutes, an NVMe SSD read would take 3 hours, an HDD seek would take 46 days, an intra-DC network call would take 6 days, a cross-AZ call would take 23 days, and a cross-continent round trip would take 3.8 years. This scaling makes it viscerally clear why hitting the network cache (Redis) instead of disk, or reading from a local replica instead of a cross-region primary, transforms the user experience.
Discord
Discord decomposes its latency budget for message delivery: 5ms for the API gateway to parse and route the message, 2ms for a Redis cache lookup of channel permissions, 5ms for Cassandra write (NVMe-backed), and 3ms for fan-out to connected WebSocket sessions. The total target is under 50ms end-to-end. Each component was designed with specific latency numbers in mind -- choosing Redis over a database for permissions (0.5ms vs 5ms) and NVMe-backed Cassandra over HDD-backed storage (5ms vs 20ms).
Cloudflare Workers
Cloudflare Workers execute at the edge, within 50ms of 95% of the world's internet-connected population. By running application code at 300+ edge locations instead of a central cloud region, they eliminate the 40-200ms cross-region latency that dominates traditional architectures. A Worker reading from Cloudflare's KV store at the edge adds 10-20ms, compared to 50-150ms for a cross-region database call. This edge-first approach is a direct application of latency-number awareness.
Google Search
Google's search results page must load in under 200ms total. This budget is decomposed across: query parsing (1ms), index lookup across distributed servers (10-30ms), ad auction (10-20ms), snippet generation (5-10ms), result serialization and rendering (5ms), and network round trip to the user (10-100ms depending on location). The index is kept entirely in memory (100ns per lookup vs 10us for SSD) across thousands of servers, because even the NVMe SSD latency would blow the budget at the query volume Google handles.
| Aspect | Description |
|---|---|
| Memory vs Storage Cost | Keeping data in RAM (100ns access) instead of NVMe SSD (10us) provides a 100x latency reduction but costs roughly 10-30x more per GB. The sweet spot depends on the dataset's access pattern: hot data in memory, warm data on NVMe, cold data on HDD or object storage. Tiered storage strategies optimize this cost-latency trade-off. |
| Caching vs Freshness | Caching moves data closer to the compute (L1 > L2 > Redis > database), dramatically reducing latency. But cached data can be stale. A Redis cache with a 60-second TTL means users may see data up to 60 seconds old. Shorter TTLs reduce staleness but increase cache misses and backend load. |
| Co-location vs Redundancy | Co-locating services in the same datacenter minimizes latency (~0.5ms) but creates a single point of geographic failure. Distributing across regions adds latency (40-200ms per cross-region call) but provides disaster recovery. The trade-off is between performance and resilience. |
| Synchronous vs Asynchronous Communication | Synchronous cross-service calls add their latency directly to the request path. Asynchronous messaging (via queues) decouples latency but introduces complexity in request-response patterns. If a user needs an immediate response, the synchronous path must fit within the latency budget; background processing can be asynchronous. |
Amazon's Latency Budgeting for Product Pages
Scenario
Amazon discovered that every 100ms of added latency to product page load times reduced sales by 1%. A typical product page requires data from over 150 microservices: product details, pricing, inventory, recommendations, reviews, seller information, shipping estimates, and ad placements. Calling each service sequentially, even at 5ms per call, would take 750ms -- far exceeding the 200ms render budget.
Solution
Amazon implemented a latency budget framework where each service call has a strict timeout (typically 20-50ms). The product page orchestrator parallelizes independent service calls (recommendations, reviews, and ads fetch simultaneously), reducing the critical path from 150 sequential calls to approximately 5 sequential stages of parallel calls. Services that exceed their latency budget are gracefully degraded -- a slow recommendation service returns a cached result instead of blocking the page. Each service's latency is decomposed against the hardware numbers: Redis cache (0.5ms) for hot data, DynamoDB (2-5ms) for lookups, and pre-computed data for anything that would require cross-service joins.
Outcome
Amazon's product pages consistently render in under 200ms for the median user, with p99 under 500ms. The latency budget framework made latency a first-class design constraint: every new service integration must declare its expected latency and fallback behavior. Services that introduce more than 10ms of critical-path latency require architectural review. This discipline, rooted in understanding latency numbers, directly translates to billions of dollars in revenue retention.
See Latency Numbers Every Engineer Should Know in action
Explore system design templates that use latency numbers every engineer should know and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Approximately how much faster is an NVMe SSD random read compared to an HDD seek in 2026?
2What sets the hard physical floor for cross-region network latency?
3A system design requires 5 sequential cross-region calls (US East to US West, ~50ms each). What is the approximate total latency contribution?