1According to Little's Law, if a system handles 2,000 RPS with an average latency of 50ms, how many concurrent requests are in-flight?
Latency is the time it takes for a single request to travel from the client to the server and back, while throughput is the number of requests a system can handle per unit of time. These two metrics are fundamentally linked but often in tension -- optimizing for one frequently comes at the cost of the other.
Latency and throughput are the two most fundamental performance metrics in system design. Latency measures the elapsed time for a single operation -- from the moment a client sends a request to the moment it receives a response. Throughput measures how many operations the system completes per unit of time, typically expressed as requests per second (RPS) or queries per second (QPS). Every system design interview and every production system tuning effort ultimately revolves around these two numbers.
The relationship between latency and throughput is governed by Little's Law: L = lambda * W, where L is the average number of concurrent requests in the system, lambda is the throughput (arrival rate), and W is the average latency (time in system). This means that for a fixed concurrency level, reducing latency directly increases throughput, and vice versa. However, the relationship breaks down under load: as throughput approaches system capacity, queueing effects cause latency to spike non-linearly.
Batching is the classic technique for trading latency for throughput. Instead of processing each request individually, the system waits to accumulate a batch, then processes them together. Kafka producers batch messages before sending; database write-ahead logs batch commits; GPU-based ML inference batches multiple inputs. Each individual request waits longer (higher latency), but the system processes more requests per second (higher throughput) because it amortizes fixed costs like network round trips, disk seeks, and context switches.
The inverse trade-off -- sacrificing throughput for latency -- appears in caching, pre-computation, and over-provisioning. A cache hit returns in microseconds instead of milliseconds, dramatically reducing latency, but the cache itself consumes memory and CPU that could otherwise serve more requests. Over-provisioning servers to keep utilization low prevents queueing delays (low latency) but wastes capacity (lower throughput per dollar). Understanding when each trade-off is appropriate is a core system design skill.
The Highway Analogy
Think of a highway between two cities. Latency is how long it takes one car to drive from City A to City B -- say 2 hours. Throughput is how many cars arrive at City B per hour -- say 1000 cars/hour. You can increase throughput by adding more lanes (parallelism) or making cars carpool (batching), but neither changes the drive time for an individual car. You can reduce latency by building a shorter highway (CDN/edge), but that does not change how many cars the original highway can handle. The only way to improve both simultaneously is a fundamental upgrade -- a faster highway with more lanes.
Amazon
Amazon famously found that every 100ms of added latency cost them 1% in sales. This drove investment in CDNs, edge caching, and precomputation. Simultaneously, their backend services handle millions of RPS by batching DynamoDB writes and using SQS to decouple throughput-sensitive pipelines from latency-sensitive request paths.
Kafka
Apache Kafka achieves high throughput (millions of messages/sec per broker) by batching producer writes (linger.ms + batch.size), using sequential disk I/O, and zero-copy transfers. The trade-off is explicit: increasing linger.ms from 0 to 5ms can double throughput but adds 5ms to end-to-end latency. Users tune this based on whether they need real-time processing or high-volume ingestion.
Google Search
Google serves search results in under 200ms despite querying thousands of index shards. They use aggressive parallelism (fan-out to all shards simultaneously), hedged requests (sending duplicate requests to reduce tail latency), and result caching. The throughput is enormous -- processing over 100,000 queries per second -- achieved through massive horizontal scaling.
| Aspect | Description |
|---|---|
| Batching: Throughput vs Latency | Batching amortizes fixed costs (network round trips, disk flushes) across many requests, increasing throughput. But each request must wait for the batch to fill or a timeout to expire, adding latency. Kafka's linger.ms, database group commit, and Nagle's algorithm all make this trade-off explicitly configurable. |
| Caching: Latency vs Consistency | Caches dramatically reduce latency by serving pre-computed results, but they consume memory that could serve more concurrent requests (reducing throughput per server) and introduce staleness. The cache hit ratio determines whether the trade-off is worthwhile -- a 99% hit rate is transformative; a 10% hit rate wastes resources. |
| Over-provisioning: Latency vs Cost | Running servers at low utilization (30-50%) keeps queueing delays minimal and latency predictable. But you pay for idle capacity. Auto-scaling helps but reacts slowly -- during a sudden traffic spike, latency degrades before new instances come online. The cost of over-provisioning is the cost of predictable latency. |
| Parallelism: Latency vs Complexity | Parallel fan-out reduces request latency by executing sub-queries concurrently. But it increases total system load (N parallel calls consume N times the resources), adds tail-latency risk (the slowest shard determines overall latency), and complicates error handling and partial failure scenarios. |
Amazon's 100ms Rule and the Latency-Revenue Curve
Scenario
In 2006, Amazon's internal studies revealed a direct correlation between page load latency and revenue. Every 100ms of added latency reduced sales by approximately 1%. At Amazon's scale, even small latency increases translated to hundreds of millions of dollars in lost revenue. However, the backend systems needed to handle explosive traffic growth -- throughput had to increase by 10x over three years without degrading latency.
Solution
Amazon invested in three parallel strategies: (1) CDN deployment (CloudFront) to move static assets closer to users, reducing latency by 50-100ms for most requests; (2) aggressive caching at every layer (in-memory caches, result caches, DNS caches) to avoid repeated computation; (3) asynchronous processing via SQS and SNS to decouple latency-critical paths from throughput-heavy background work. The key insight was separating the latency-sensitive read path (which must be fast) from the throughput-sensitive write path (which can batch and queue).
Outcome
By 2010, Amazon had reduced median page load time from 1.5s to under 300ms while increasing throughput by over 20x. The architecture became the template for CQRS (Command Query Responsibility Segregation) patterns used across the industry. The lesson: latency and throughput are not a zero-sum trade-off if you separate the read and write paths and optimize each independently.
See Latency vs Throughput in action
Explore system design templates that use latency vs throughput and run traffic simulations to see how these concepts perform under real load.
Browse Templates1According to Little's Law, if a system handles 2,000 RPS with an average latency of 50ms, how many concurrent requests are in-flight?
2Why does latency spike non-linearly as server utilization approaches 100%?