What is important about Latency vs Throughput regarding "Latency is measured per-request (p50, p95, p99 percentiles);..."?

Latency is measured per-request (p50, p95, p99 percentiles); throughput is measured system-wide (RPS, QPS, MB/s). Both are essential: a system with 1ms latency but 10 RPS throughput is useless for high-traffic services, and a system handling 1M RPS with 10s latency creates terrible user experience.

What is important about Latency vs Throughput regarding "Little's Law (L = lambda * W) connects concurrency, throughp..."?

Little's Law (L = lambda * W) connects concurrency, throughput, and latency mathematically. If your server handles 1000 RPS with 100ms average latency, you have 100 concurrent requests in-flight at any time. This formula is invaluable for capacity planning.

What is important about Latency vs Throughput regarding "Queueing theory explains why latency spikes as utilization i..."?

Queueing theory explains why latency spikes as utilization increases. At 50% utilization, average queue wait is roughly equal to service time. At 90%, it is 9x service time. At 99%, it is 99x. This is why keeping servers below 70% CPU is standard practice.

What is important about Latency vs Throughput regarding "Batching trades latency for throughput by amortizing fixed c..."?

Batching trades latency for throughput by amortizing fixed costs. Kafka batches messages (linger.ms), databases batch WAL commits (group commit), and GPUs batch inference requests. The optimal batch size balances latency SLA against throughput gain.

What is important about Latency vs Throughput regarding "Pipelining and parallelism can improve both metrics simultan..."?

Pipelining and parallelism can improve both metrics simultaneously -- up to a point. HTTP/2 multiplexing sends multiple requests over one connection (higher throughput, same latency). Parallel fan-out queries reduce latency but increase total system load.

What is important about Latency vs Throughput regarding "The speed of light imposes a hard latency floor: ~1ms per 20..."?

The speed of light imposes a hard latency floor: ~1ms per 200km of fiber, ~67ms US coast-to-coast, ~130ms transatlantic. No optimization can beat physics; CDNs and edge computing address this by moving data closer to users.

Vetora

⏱️Performance

Latency vs Throughput

Latency is the time it takes for a single request to travel from the client to the server and back, while throughput is the number of requests a system can handle per unit of time. These two metrics are fundamentally linked but often in tension -- optimizing for one frequently comes at the cost of the other.

Overview

Latency and throughput are the two most fundamental performance metrics in system design. Latency measures the elapsed time for a single operation -- from the moment a client sends a request to the moment it receives a response. Throughput measures how many operations the system completes per unit of time, typically expressed as requests per second (RPS) or queries per second (QPS). Every system design interview and every production system tuning effort ultimately revolves around these two numbers.

The relationship between latency and throughput is governed by Little's Law: L = lambda * W, where L is the average number of concurrent requests in the system, lambda is the throughput (arrival rate), and W is the average latency (time in system). This means that for a fixed concurrency level, reducing latency directly increases throughput, and vice versa. However, the relationship breaks down under load: as throughput approaches system capacity, queueing effects cause latency to spike non-linearly.

Batching is the classic technique for trading latency for throughput. Instead of processing each request individually, the system waits to accumulate a batch, then processes them together. Kafka producers batch messages before sending; database write-ahead logs batch commits; GPU-based ML inference batches multiple inputs. Each individual request waits longer (higher latency), but the system processes more requests per second (higher throughput) because it amortizes fixed costs like network round trips, disk seeks, and context switches.

The inverse trade-off -- sacrificing throughput for latency -- appears in caching, pre-computation, and over-provisioning. A cache hit returns in microseconds instead of milliseconds, dramatically reducing latency, but the cache itself consumes memory and CPU that could otherwise serve more requests. Over-provisioning servers to keep utilization low prevents queueing delays (low latency) but wastes capacity (lower throughput per dollar). Understanding when each trade-off is appropriate is a core system design skill.

Key Points

1Latency is measured per-request (p50, p95, p99 percentiles); throughput is measured system-wide (RPS, QPS, MB/s). Both are essential: a system with 1ms latency but 10 RPS throughput is useless for high-traffic services, and a system handling 1M RPS with 10s latency creates terrible user experience.
2Little's Law (L = lambda * W) connects concurrency, throughput, and latency mathematically. If your server handles 1000 RPS with 100ms average latency, you have 100 concurrent requests in-flight at any time. This formula is invaluable for capacity planning.
3Queueing theory explains why latency spikes as utilization increases. At 50% utilization, average queue wait is roughly equal to service time. At 90%, it is 9x service time. At 99%, it is 99x. This is why keeping servers below 70% CPU is standard practice.
4Batching trades latency for throughput by amortizing fixed costs. Kafka batches messages (linger.ms), databases batch WAL commits (group commit), and GPUs batch inference requests. The optimal batch size balances latency SLA against throughput gain.
5Pipelining and parallelism can improve both metrics simultaneously -- up to a point. HTTP/2 multiplexing sends multiple requests over one connection (higher throughput, same latency). Parallel fan-out queries reduce latency but increase total system load.
6The speed of light imposes a hard latency floor: ~1ms per 200km of fiber, ~67ms US coast-to-coast, ~130ms transatlantic. No optimization can beat physics; CDNs and edge computing address this by moving data closer to users.

Simple Example

The Highway Analogy

Think of a highway between two cities. Latency is how long it takes one car to drive from City A to City B -- say 2 hours. Throughput is how many cars arrive at City B per hour -- say 1000 cars/hour. You can increase throughput by adding more lanes (parallelism) or making cars carpool (batching), but neither changes the drive time for an individual car. You can reduce latency by building a shorter highway (CDN/edge), but that does not change how many cars the original highway can handle. The only way to improve both simultaneously is a fundamental upgrade -- a faster highway with more lanes.

Real-World Examples

Amazon

Amazon famously found that every 100ms of added latency cost them 1% in sales. This drove investment in CDNs, edge caching, and precomputation. Simultaneously, their backend services handle millions of RPS by batching DynamoDB writes and using SQS to decouple throughput-sensitive pipelines from latency-sensitive request paths.

Kafka

Apache Kafka achieves high throughput (millions of messages/sec per broker) by batching producer writes (linger.ms + batch.size), using sequential disk I/O, and zero-copy transfers. The trade-off is explicit: increasing linger.ms from 0 to 5ms can double throughput but adds 5ms to end-to-end latency. Users tune this based on whether they need real-time processing or high-volume ingestion.

Google Search

Google serves search results in under 200ms despite querying thousands of index shards. They use aggressive parallelism (fan-out to all shards simultaneously), hedged requests (sending duplicate requests to reduce tail latency), and result caching. The throughput is enormous -- processing over 100,000 queries per second -- achieved through massive horizontal scaling.

Trade-Offs

Aspect	Description
Batching: Throughput vs Latency	Batching amortizes fixed costs (network round trips, disk flushes) across many requests, increasing throughput. But each request must wait for the batch to fill or a timeout to expire, adding latency. Kafka's linger.ms, database group commit, and Nagle's algorithm all make this trade-off explicitly configurable.
Caching: Latency vs Consistency	Caches dramatically reduce latency by serving pre-computed results, but they consume memory that could serve more concurrent requests (reducing throughput per server) and introduce staleness. The cache hit ratio determines whether the trade-off is worthwhile -- a 99% hit rate is transformative; a 10% hit rate wastes resources.
Over-provisioning: Latency vs Cost	Running servers at low utilization (30-50%) keeps queueing delays minimal and latency predictable. But you pay for idle capacity. Auto-scaling helps but reacts slowly -- during a sudden traffic spike, latency degrades before new instances come online. The cost of over-provisioning is the cost of predictable latency.
Parallelism: Latency vs Complexity	Parallel fan-out reduces request latency by executing sub-queries concurrently. But it increases total system load (N parallel calls consume N times the resources), adds tail-latency risk (the slowest shard determines overall latency), and complicates error handling and partial failure scenarios.

Case Study

Amazon's 100ms Rule and the Latency-Revenue Curve

Scenario

In 2006, Amazon's internal studies revealed a direct correlation between page load latency and revenue. Every 100ms of added latency reduced sales by approximately 1%. At Amazon's scale, even small latency increases translated to hundreds of millions of dollars in lost revenue. However, the backend systems needed to handle explosive traffic growth -- throughput had to increase by 10x over three years without degrading latency.

Solution

Amazon invested in three parallel strategies: (1) CDN deployment (CloudFront) to move static assets closer to users, reducing latency by 50-100ms for most requests; (2) aggressive caching at every layer (in-memory caches, result caches, DNS caches) to avoid repeated computation; (3) asynchronous processing via SQS and SNS to decouple latency-critical paths from throughput-heavy background work. The key insight was separating the latency-sensitive read path (which must be fast) from the throughput-sensitive write path (which can batch and queue).

Outcome

By 2010, Amazon had reduced median page load time from 1.5s to under 300ms while increasing throughput by over 20x. The architecture became the template for CQRS (Command Query Responsibility Segregation) patterns used across the industry. The lesson: latency and throughput are not a zero-sum trade-off if you separate the read and write paths and optimize each independently.

Common Mistakes

⚠Using average latency instead of percentiles. An average of 50ms can hide that 1% of requests take 5 seconds. Always measure p50, p95, p99, and p99.9 -- the tail is where user pain lives.
⚠Assuming throughput scales linearly with resources. Doubling servers rarely doubles throughput due to coordination overhead, shared state contention, and Amdahl's Law. Measure actual throughput gains before committing to scaling plans.
⚠Ignoring queueing effects at high utilization. Systems behave well at 50% utilization but degrade catastrophically above 80%. The M/M/1 queueing model shows wait time = service_time / (1 - utilization), so at 90% utilization, average wait is 9x the service time.
⚠Optimizing latency and throughput simultaneously without understanding the trade-off. Adding a cache reduces latency but may not increase throughput if the bottleneck is downstream. Profile first, then optimize the actual bottleneck.

Related Concepts

P99 & Tail Latency Capacity Planning Load Testing & Benchmarking CDN & Edge Caching Cache-Aside Pattern

See Latency vs Throughput in action

Explore system design templates that use latency vs throughput and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Observe the latency-throughput curve as traffic increases

Metrics to watch

p50_latency_msp99_latency_msthroughput_rpscpu_utilization_pct

Run Simulation

Test Your Understanding

1According to Little's Law, if a system handles 2,000 RPS with an average latency of 50ms, how many concurrent requests are in-flight?

2Why does latency spike non-linearly as server utilization approaches 100%?

Deeper Reading