1What is coordinated omission in load testing?
Load testing is the practice of simulating realistic traffic against a system to measure performance under expected and peak conditions. Benchmarking is the related practice of measuring the maximum throughput and latency characteristics of individual components. Together, they validate capacity plans, identify bottlenecks, and establish performance baselines before production deployment.
Load testing is the bridge between capacity planning estimates and production readiness. A back-of-envelope calculation might tell you that your system needs to handle 10,000 RPS, but only a load test can confirm whether it actually does. Without load testing, you are deploying to production on faith -- hoping that your database can handle the query volume, your application servers can manage the concurrency, and your network can sustain the bandwidth. Load testing replaces hope with evidence.
There are four main types of performance tests. Load testing simulates expected production traffic to verify the system meets latency and throughput SLOs under normal conditions. Stress testing pushes beyond expected capacity to find the breaking point -- the load at which latency degrades unacceptably or errors begin. Soak testing (endurance testing) runs a sustained load for hours or days to detect slow resource leaks: memory leaks, connection pool exhaustion, log file growth, and database bloat that only manifest over time. Spike testing subjects the system to sudden traffic bursts (e.g., simulating a flash sale or viral event) to test auto-scaling, circuit breakers, and graceful degradation.
The most critical technical consideration in load testing is avoiding coordinated omission. This measurement error occurs when a closed-loop load generator waits for each response before sending the next request. During a period of slowness (e.g., a GC pause), the generator sends fewer requests, so fewer requests are measured as slow. The result is an artificially rosy picture of tail latency. Open-loop generators like wrk2, k6, and Gatling maintain a constant request rate regardless of response time, accurately measuring the impact of slowdowns on all requests that would have arrived during the delay.
Effective load testing requires production-like conditions: realistic data volumes (not empty databases), production-like hardware (not developer laptops), representative traffic patterns (not uniform request distributions), and proper warm-up periods (to fill caches and JIT compile code). A load test on a database with 1,000 rows tells you nothing about performance with 100 million rows. A benchmark on a cold JVM tells you nothing about steady-state performance after JIT compilation. The quality of the test environment determines the quality of the results.
Load Testing an API Endpoint
You have a product search API that must handle 5,000 RPS with p99 latency under 200ms. Using k6, you configure a test: ramp from 0 to 5,000 RPS over 2 minutes, hold at 5,000 for 10 minutes, then ramp down. During the test, you observe: p50=25ms, p95=80ms, p99=350ms. The p99 exceeds your 200ms SLO. Resource monitoring shows CPU at 45% but database connections maxed at 100 (the pool limit). The bottleneck is connection pool size, not CPU. You increase the pool to 200, rerun the test: p99 drops to 120ms. The load test identified a bottleneck that would have caused an outage at peak traffic.
Amazon
Amazon runs 'GameDay' load tests before every Prime Day, simulating 10-20x normal traffic across their entire infrastructure. These tests have revealed bottlenecks in unexpected places -- DNS resolution, TLS handshake overhead, and even CloudWatch metric ingestion -- that would have caused outages at Prime Day scale. Fixes from GameDay exercises have prevented multiple potential incidents.
GitHub
GitHub uses Scientist (a Ruby library they created) to benchmark proposed code changes against production traffic. New code runs alongside the existing implementation on real requests, and the results are compared. If the new code produces different results or is significantly slower, it is flagged. This combines load testing with correctness verification in production.
Stripe
Stripe runs continuous performance tests against their payment API using production-like traffic patterns. They discovered that their system handled sustained load well but degraded under bursty traffic due to database connection storm behavior -- many requests simultaneously opening new connections. Implementing connection pooling with pre-warming eliminated the spike sensitivity.
| Aspect | Description |
|---|---|
| Production vs Staging Load Tests | Staging tests are safe but may not represent production (different hardware, data, traffic patterns). Production load tests (shadow traffic, canary analysis) are more realistic but risk affecting real users. The best approach: staging for breaking-point tests, production for validation with shadow traffic or during low-traffic windows. |
| Synthetic vs Replay Traffic | Synthetic traffic (generated from a model) is easy to produce but may not match real user behavior patterns. Replay traffic (recorded from production) is realistic but may contain sensitive data, be difficult to reproduce timing, and not scale easily. Hybrid approaches (replay patterns with synthetic data) balance realism and practicality. |
| Test Duration vs Resource Cost | Short tests (5-10 minutes) are cheap and fast but miss slow leaks and time-dependent issues. Long soak tests (24-72 hours) catch memory leaks and connection exhaustion but are expensive and block testing infrastructure. Run short tests frequently (every PR) and long tests weekly or before major releases. |
| Accuracy vs Complexity of Load Model | A simple load model (uniform request rate, single endpoint) is easy to build but unrealistic. A complex model (multiple endpoints, think times, session state, data dependencies) is accurate but expensive to maintain. Start with a simple model that hits the critical path, then add complexity for endpoints that historically cause issues. |
Coordinated Omission in a Major E-Commerce Load Test
Scenario
An e-commerce company was preparing for their annual flash sale, expecting 50x normal traffic. They ran a load test using Apache Benchmark (ab), a closed-loop generator, at 50,000 RPS. The results looked excellent: p99 latency was 150ms, well within their 200ms SLO. Confident in the results, they deployed to production. During the actual flash sale, p99 latency exceeded 3 seconds, and the site experienced partial outages.
Solution
Post-mortem analysis revealed that the load test suffered from coordinated omission. Apache Benchmark waits for each response before sending the next request. During the test, a periodic garbage collection pause (every 30 seconds, lasting 200ms) delayed responses. During these pauses, ab sent fewer requests, so fewer requests experienced the delay. The measured p99 was artificially low. Rerunning the test with wrk2 (an open-loop generator that maintains constant request rate) revealed the true p99: 2.5 seconds, because requests that arrived during GC pauses all experienced the full delay.
Outcome
The team switched to wrk2 for all future load tests and implemented three fixes: (1) tuned GC settings to reduce pause duration from 200ms to 20ms, (2) added request hedging for critical API calls, and (3) set up automated load tests with wrk2 in CI/CD to catch regressions. The next flash sale handled 50x traffic with p99 under 100ms. The key lesson: the load testing tool matters as much as the load test itself. Closed-loop generators give false confidence; open-loop generators reveal the truth.
See Load Testing & Benchmarking in action
Explore system design templates that use load testing & benchmarking and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What is coordinated omission in load testing?
2Why is soak testing (endurance testing) important even if load tests pass?