1What happens when a Kubernetes readiness probe fails?
Health checks (liveness probes) verify a process is running and not deadlocked. Readiness probes verify a service can handle traffic. Startup probes give slow-starting services time to initialize. Together they enable load balancers, orchestrators, and service meshes to route traffic only to healthy instances.
Health checks are the mechanism by which infrastructure determines whether an application instance is functioning correctly. In modern container orchestration (Kubernetes, ECS, Nomad) and load balancing (ALB, nginx, Envoy), health checks control two critical decisions: should this instance receive traffic, and should this instance be restarted?
Liveness probes answer: 'is this process alive and responsive?' If a liveness probe fails repeatedly, the orchestrator restarts the container. This catches deadlocks, infinite loops, and memory corruption where the process is running but not functioning. A typical liveness endpoint returns HTTP 200 if the process can handle requests and does not check external dependencies -- if the database is down, the process is still 'alive' (it can return errors).
Readiness probes answer: 'can this instance handle traffic right now?' If a readiness probe fails, the instance is removed from the load balancer's target group but is NOT restarted. This is appropriate when the instance is alive but temporarily unable to serve: it is warming up a cache, running a migration, or a downstream dependency is unreachable. When the readiness probe passes again, the instance is re-added to the target group.
Startup probes (Kubernetes 1.18+) address slow-starting applications. Before the startup probe succeeds, liveness and readiness probes are disabled. This prevents the liveness probe from killing a container that is legitimately still initializing (e.g., a JVM loading a large model, a service warming a cache). Without startup probes, teams set artificially high liveness `initialDelaySeconds`, which delays detection of genuine deadlocks during startup.
The design of health check endpoints requires careful thought. An endpoint that checks every dependency (database, cache, queue, downstream services) creates a fragile system: if any dependency is briefly unreachable, all instances fail readiness simultaneously, causing a total outage even though the service could still serve cached data or degrade gracefully. The standard practice is: liveness checks only the process itself, readiness checks critical dependencies, and a separate `/health/detailed` endpoint exposes full dependency status for debugging.
A Spring Boot Service with Three Probe Types
A Spring Boot service registers three endpoints: /health/liveness returns 200 if the JVM is responsive (no dependency checks). /health/readiness returns 200 if the PostgreSQL connection pool has at least 1 available connection AND the Redis cache is reachable. /health/startup returns 200 once the Flyway migration has completed and the application context is fully initialized. Kubernetes is configured with: startupProbe (http /health/startup, period 5s, failure 24 = 2 min max startup), livenessProbe (http /health/liveness, period 10s, failure 3), readinessProbe (http /health/readiness, period 5s, failure 2).
Kubernetes
Kubernetes pioneered the three-probe model (liveness, readiness, startup) that has become the industry standard. The kubelet executes probes via HTTP GET, TCP socket, or exec command. Probe results drive pod lifecycle decisions: liveness failures trigger container restarts, readiness failures remove the pod from Service endpoints, and startup failures prevent premature liveness checks.
AWS ALB
AWS Application Load Balancer uses health checks to manage target groups. Each target registers a health check path (e.g., /health), interval (default 30s), and threshold (default 5 consecutive failures). Unhealthy targets receive no traffic until they pass the 'healthy threshold' (default 5 consecutive successes). This prevents routing traffic to instances that just started and have not yet warmed up.
Netflix
Netflix uses a health check pattern called 'dependency health' where each service reports the health of its dependencies as a weighted score. If a non-critical dependency is down, the service reports itself as 'partially healthy' and the load balancer reduces (but does not eliminate) traffic to that instance. This prevents hard dependency failures from causing complete instance removal.
| Aspect | Description |
|---|---|
| Thorough Checks vs. Cascading Failures | Checking every dependency in the readiness probe means a single dependency failure removes all instances from the load balancer simultaneously. Checking nothing means routing traffic to instances that cannot serve requests. The balance: check only hard dependencies required for the core request path. |
| Aggressive Timeouts vs. False Positives | Short probe intervals (3s) and low failure thresholds (1) detect failures fast but cause false positives during GC pauses, deployment surges, or transient network blips. Conservative settings (10s interval, 3 failures) are more stable but delay failure detection by 30+ seconds. |
| Restart (Liveness) vs. Drain (Readiness) | Restarting a misbehaving container (liveness failure) is a hard reset that may fix the issue but disrupts in-flight requests. Draining traffic (readiness failure) is gentle but keeps the broken instance running, consuming resources. Use liveness for unrecoverable states (deadlock) and readiness for transient states (dependency down). |
| Startup Probe vs. Initial Delay | Before startup probes existed, teams used `initialDelaySeconds` on liveness probes to give apps time to start. But if the app starts faster than the delay, deadlocks during startup go undetected. Startup probes decouple startup time from liveness checking, providing the best of both worlds. |
Shopify's Black Friday Health Check Incident
Scenario
During Black Friday 2019, Shopify experienced a cascading failure triggered by health checks. Their readiness probe checked MySQL connectivity with a simple 'SELECT 1' query. Under extreme load, MySQL's connection pool was saturated and health check queries timed out. Kubernetes marked all pods as not ready and removed them from the Service, causing a total traffic blackout even though the pods could serve cached responses for most endpoints.
Solution
Post-incident, Shopify redesigned their health checks: readiness probes check only whether the connection pool has at least 1 available connection (without executing a query), and a separate deep health endpoint is used for monitoring dashboards only. They also added a 'degraded' state that reduces traffic weight instead of fully removing the pod.
Outcome
The redesigned probes prevented cascading failures during subsequent high-traffic events by decoupling readiness from full dependency health, allowing pods to serve cached and degraded responses when dependencies were under pressure.
See Health Checks & Readiness Probes in action
Explore system design templates that use health checks & readiness probes and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What happens when a Kubernetes readiness probe fails?
2Why should liveness probes NOT check external dependencies like databases?