Vetora logo
💨Performance

Load Testing & Benchmarking

Load testing is the practice of simulating realistic traffic against a system to measure performance under expected and peak conditions. Benchmarking is the related practice of measuring the maximum throughput and latency characteristics of individual components. Together, they validate capacity plans, identify bottlenecks, and establish performance baselines before production deployment.

Overview

Load testing is the bridge between capacity planning estimates and production readiness. A back-of-envelope calculation might tell you that your system needs to handle 10,000 RPS, but only a load test can confirm whether it actually does. Without load testing, you are deploying to production on faith -- hoping that your database can handle the query volume, your application servers can manage the concurrency, and your network can sustain the bandwidth. Load testing replaces hope with evidence.

There are four main types of performance tests. Load testing simulates expected production traffic to verify the system meets latency and throughput SLOs under normal conditions. Stress testing pushes beyond expected capacity to find the breaking point -- the load at which latency degrades unacceptably or errors begin. Soak testing (endurance testing) runs a sustained load for hours or days to detect slow resource leaks: memory leaks, connection pool exhaustion, log file growth, and database bloat that only manifest over time. Spike testing subjects the system to sudden traffic bursts (e.g., simulating a flash sale or viral event) to test auto-scaling, circuit breakers, and graceful degradation.

The most critical technical consideration in load testing is avoiding coordinated omission. This measurement error occurs when a closed-loop load generator waits for each response before sending the next request. During a period of slowness (e.g., a GC pause), the generator sends fewer requests, so fewer requests are measured as slow. The result is an artificially rosy picture of tail latency. Open-loop generators like wrk2, k6, and Gatling maintain a constant request rate regardless of response time, accurately measuring the impact of slowdowns on all requests that would have arrived during the delay.

Effective load testing requires production-like conditions: realistic data volumes (not empty databases), production-like hardware (not developer laptops), representative traffic patterns (not uniform request distributions), and proper warm-up periods (to fill caches and JIT compile code). A load test on a database with 1,000 rows tells you nothing about performance with 100 million rows. A benchmark on a cold JVM tells you nothing about steady-state performance after JIT compilation. The quality of the test environment determines the quality of the results.

Key Points
  • 1Four types of performance tests: load (normal traffic, verify SLOs), stress (beyond capacity, find breaking point), soak (sustained load, find leaks), spike (sudden burst, test scaling). Each reveals different issues and should be run at different stages of the development lifecycle.
  • 2Open-loop vs closed-loop generators: closed-loop (ab, wrk without rate limiting) slow down when the server slows down, hiding tail latency (coordinated omission). Open-loop (wrk2, k6, Gatling) maintain constant request rate, accurately measuring tail latency impact.
  • 3Always warm up before measuring. JVM JIT compilation, cache population, connection pool filling, and database query plan caching all improve performance after initial requests. Run at least 2-5 minutes of warm-up traffic before collecting measurements.
  • 4Test with production-like data volumes. Query performance is often O(n) or O(n log n) -- a query that takes 1ms with 1,000 rows may take 100ms with 10 million rows. Synthetic data should match production cardinality, distribution, and size.
  • 5Monitor system resources during load tests, not just response times. CPU, memory, disk I/O, network bandwidth, connection pool usage, and GC pauses provide context for why performance degrades. A CPU-bound bottleneck and an I/O-bound bottleneck require very different solutions.
  • 6Run load tests regularly, not just before launches. Performance regressions are introduced gradually by individual commits. Automated performance tests in CI/CD catch regressions before they reach production -- a 5% regression per release compounds to 50% degradation over 10 releases.
Simple Example

Load Testing an API Endpoint

You have a product search API that must handle 5,000 RPS with p99 latency under 200ms. Using k6, you configure a test: ramp from 0 to 5,000 RPS over 2 minutes, hold at 5,000 for 10 minutes, then ramp down. During the test, you observe: p50=25ms, p95=80ms, p99=350ms. The p99 exceeds your 200ms SLO. Resource monitoring shows CPU at 45% but database connections maxed at 100 (the pool limit). The bottleneck is connection pool size, not CPU. You increase the pool to 200, rerun the test: p99 drops to 120ms. The load test identified a bottleneck that would have caused an outage at peak traffic.

Real-World Examples

Amazon

Amazon runs 'GameDay' load tests before every Prime Day, simulating 10-20x normal traffic across their entire infrastructure. These tests have revealed bottlenecks in unexpected places -- DNS resolution, TLS handshake overhead, and even CloudWatch metric ingestion -- that would have caused outages at Prime Day scale. Fixes from GameDay exercises have prevented multiple potential incidents.

GitHub

GitHub uses Scientist (a Ruby library they created) to benchmark proposed code changes against production traffic. New code runs alongside the existing implementation on real requests, and the results are compared. If the new code produces different results or is significantly slower, it is flagged. This combines load testing with correctness verification in production.

Stripe

Stripe runs continuous performance tests against their payment API using production-like traffic patterns. They discovered that their system handled sustained load well but degraded under bursty traffic due to database connection storm behavior -- many requests simultaneously opening new connections. Implementing connection pooling with pre-warming eliminated the spike sensitivity.

Trade-Offs
AspectDescription
Production vs Staging Load TestsStaging tests are safe but may not represent production (different hardware, data, traffic patterns). Production load tests (shadow traffic, canary analysis) are more realistic but risk affecting real users. The best approach: staging for breaking-point tests, production for validation with shadow traffic or during low-traffic windows.
Synthetic vs Replay TrafficSynthetic traffic (generated from a model) is easy to produce but may not match real user behavior patterns. Replay traffic (recorded from production) is realistic but may contain sensitive data, be difficult to reproduce timing, and not scale easily. Hybrid approaches (replay patterns with synthetic data) balance realism and practicality.
Test Duration vs Resource CostShort tests (5-10 minutes) are cheap and fast but miss slow leaks and time-dependent issues. Long soak tests (24-72 hours) catch memory leaks and connection exhaustion but are expensive and block testing infrastructure. Run short tests frequently (every PR) and long tests weekly or before major releases.
Accuracy vs Complexity of Load ModelA simple load model (uniform request rate, single endpoint) is easy to build but unrealistic. A complex model (multiple endpoints, think times, session state, data dependencies) is accurate but expensive to maintain. Start with a simple model that hits the critical path, then add complexity for endpoints that historically cause issues.
Case Study

Coordinated Omission in a Major E-Commerce Load Test

Scenario

An e-commerce company was preparing for their annual flash sale, expecting 50x normal traffic. They ran a load test using Apache Benchmark (ab), a closed-loop generator, at 50,000 RPS. The results looked excellent: p99 latency was 150ms, well within their 200ms SLO. Confident in the results, they deployed to production. During the actual flash sale, p99 latency exceeded 3 seconds, and the site experienced partial outages.

Solution

Post-mortem analysis revealed that the load test suffered from coordinated omission. Apache Benchmark waits for each response before sending the next request. During the test, a periodic garbage collection pause (every 30 seconds, lasting 200ms) delayed responses. During these pauses, ab sent fewer requests, so fewer requests experienced the delay. The measured p99 was artificially low. Rerunning the test with wrk2 (an open-loop generator that maintains constant request rate) revealed the true p99: 2.5 seconds, because requests that arrived during GC pauses all experienced the full delay.

Outcome

The team switched to wrk2 for all future load tests and implemented three fixes: (1) tuned GC settings to reduce pause duration from 200ms to 20ms, (2) added request hedging for critical API calls, and (3) set up automated load tests with wrk2 in CI/CD to catch regressions. The next flash sale handled 50x traffic with p99 under 100ms. The key lesson: the load testing tool matters as much as the load test itself. Closed-loop generators give false confidence; open-loop generators reveal the truth.

Common Mistakes
  • Using a closed-loop load generator (ab, basic wrk, basic JMeter) and not accounting for coordinated omission. These tools underreport tail latency by 10-100x. Always use open-loop generators (wrk2, k6, Gatling) for accurate tail latency measurement.
  • Load testing against an empty or trivially small database. Real performance depends on data volume -- index sizes, buffer pool hit rates, and query plan selection all change with data size. Populate test databases to production-like scale before testing.
  • Running load tests from a single client machine. If the client saturates its own CPU or network before the server, you are measuring client limitations, not server performance. Distribute load generation across multiple machines.
  • Not monitoring server-side metrics during the load test. Without CPU, memory, disk I/O, and connection pool metrics, you know that performance is bad but not why. Always capture system metrics alongside response time measurements to diagnose bottlenecks.
Related Concepts

See Load Testing & Benchmarking in action

Explore system design templates that use load testing & benchmarking and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Ramp traffic to find the breaking point of a flash sale system

Metrics to watch
saturation_point_rpserror_onset_rpsp99_latency_msrecovery_time_ms
Run Simulation
Test Your Understanding

1What is coordinated omission in load testing?

2Why is soak testing (endurance testing) important even if load tests pass?

Deeper Reading