Vetora logo
💚Observability

Health Checks & Readiness Probes

Health checks (liveness probes) verify a process is running and not deadlocked. Readiness probes verify a service can handle traffic. Startup probes give slow-starting services time to initialize. Together they enable load balancers, orchestrators, and service meshes to route traffic only to healthy instances.

Overview

Health checks are the mechanism by which infrastructure determines whether an application instance is functioning correctly. In modern container orchestration (Kubernetes, ECS, Nomad) and load balancing (ALB, nginx, Envoy), health checks control two critical decisions: should this instance receive traffic, and should this instance be restarted?

Liveness probes answer: 'is this process alive and responsive?' If a liveness probe fails repeatedly, the orchestrator restarts the container. This catches deadlocks, infinite loops, and memory corruption where the process is running but not functioning. A typical liveness endpoint returns HTTP 200 if the process can handle requests and does not check external dependencies -- if the database is down, the process is still 'alive' (it can return errors).

Readiness probes answer: 'can this instance handle traffic right now?' If a readiness probe fails, the instance is removed from the load balancer's target group but is NOT restarted. This is appropriate when the instance is alive but temporarily unable to serve: it is warming up a cache, running a migration, or a downstream dependency is unreachable. When the readiness probe passes again, the instance is re-added to the target group.

Startup probes (Kubernetes 1.18+) address slow-starting applications. Before the startup probe succeeds, liveness and readiness probes are disabled. This prevents the liveness probe from killing a container that is legitimately still initializing (e.g., a JVM loading a large model, a service warming a cache). Without startup probes, teams set artificially high liveness `initialDelaySeconds`, which delays detection of genuine deadlocks during startup.

The design of health check endpoints requires careful thought. An endpoint that checks every dependency (database, cache, queue, downstream services) creates a fragile system: if any dependency is briefly unreachable, all instances fail readiness simultaneously, causing a total outage even though the service could still serve cached data or degrade gracefully. The standard practice is: liveness checks only the process itself, readiness checks critical dependencies, and a separate `/health/detailed` endpoint exposes full dependency status for debugging.

Key Points
  • 1Liveness probes detect deadlocked or crashed processes. They should be lightweight (check process responsiveness) and NOT check external dependencies. A false liveness failure causes an unnecessary restart, potentially triggering cascading failures.
  • 2Readiness probes determine traffic routing. They should check critical dependencies (DB connection pool has available connections, required caches are reachable). Failing readiness removes the instance from the load balancer without restarting it.
  • 3Startup probes protect slow-starting apps. They run during initialization and gate liveness/readiness probes until the app is ready. Use for JVM warmup, ML model loading, or cache priming that takes 30-120 seconds.
  • 4Health check endpoints must respond fast (<200ms). A health check that runs a DB query, makes an HTTP call, and checks disk space can itself time out under load, causing false failures.
  • 5Never make all dependencies hard readiness checks. If the recommendation service is down, the product page can still show products without recommendations. Check only dependencies required for the service's core function.
  • 6In Kubernetes, configure `failureThreshold` and `periodSeconds` carefully. Liveness: 3 failures × 10s = 30s before restart. Too aggressive (1 failure × 3s) causes restarts during normal GC pauses.
Simple Example

A Spring Boot Service with Three Probe Types

A Spring Boot service registers three endpoints: /health/liveness returns 200 if the JVM is responsive (no dependency checks). /health/readiness returns 200 if the PostgreSQL connection pool has at least 1 available connection AND the Redis cache is reachable. /health/startup returns 200 once the Flyway migration has completed and the application context is fully initialized. Kubernetes is configured with: startupProbe (http /health/startup, period 5s, failure 24 = 2 min max startup), livenessProbe (http /health/liveness, period 10s, failure 3), readinessProbe (http /health/readiness, period 5s, failure 2).

Real-World Examples

Kubernetes

Kubernetes pioneered the three-probe model (liveness, readiness, startup) that has become the industry standard. The kubelet executes probes via HTTP GET, TCP socket, or exec command. Probe results drive pod lifecycle decisions: liveness failures trigger container restarts, readiness failures remove the pod from Service endpoints, and startup failures prevent premature liveness checks.

AWS ALB

AWS Application Load Balancer uses health checks to manage target groups. Each target registers a health check path (e.g., /health), interval (default 30s), and threshold (default 5 consecutive failures). Unhealthy targets receive no traffic until they pass the 'healthy threshold' (default 5 consecutive successes). This prevents routing traffic to instances that just started and have not yet warmed up.

Netflix

Netflix uses a health check pattern called 'dependency health' where each service reports the health of its dependencies as a weighted score. If a non-critical dependency is down, the service reports itself as 'partially healthy' and the load balancer reduces (but does not eliminate) traffic to that instance. This prevents hard dependency failures from causing complete instance removal.

Trade-Offs
AspectDescription
Thorough Checks vs. Cascading FailuresChecking every dependency in the readiness probe means a single dependency failure removes all instances from the load balancer simultaneously. Checking nothing means routing traffic to instances that cannot serve requests. The balance: check only hard dependencies required for the core request path.
Aggressive Timeouts vs. False PositivesShort probe intervals (3s) and low failure thresholds (1) detect failures fast but cause false positives during GC pauses, deployment surges, or transient network blips. Conservative settings (10s interval, 3 failures) are more stable but delay failure detection by 30+ seconds.
Restart (Liveness) vs. Drain (Readiness)Restarting a misbehaving container (liveness failure) is a hard reset that may fix the issue but disrupts in-flight requests. Draining traffic (readiness failure) is gentle but keeps the broken instance running, consuming resources. Use liveness for unrecoverable states (deadlock) and readiness for transient states (dependency down).
Startup Probe vs. Initial DelayBefore startup probes existed, teams used `initialDelaySeconds` on liveness probes to give apps time to start. But if the app starts faster than the delay, deadlocks during startup go undetected. Startup probes decouple startup time from liveness checking, providing the best of both worlds.
Case Study

Shopify's Black Friday Health Check Incident

Scenario

During Black Friday 2019, Shopify experienced a cascading failure triggered by health checks. Their readiness probe checked MySQL connectivity with a simple 'SELECT 1' query. Under extreme load, MySQL's connection pool was saturated and health check queries timed out. Kubernetes marked all pods as not ready and removed them from the Service, causing a total traffic blackout even though the pods could serve cached responses for most endpoints.

Solution

Post-incident, Shopify redesigned their health checks: readiness probes check only whether the connection pool has at least 1 available connection (without executing a query), and a separate deep health endpoint is used for monitoring dashboards only. They also added a 'degraded' state that reduces traffic weight instead of fully removing the pod.

Outcome

The redesigned probes prevented cascading failures during subsequent high-traffic events by decoupling readiness from full dependency health, allowing pods to serve cached and degraded responses when dependencies were under pressure.

Common Mistakes
  • Checking external dependencies in the liveness probe: If the database goes down, the liveness probe fails, Kubernetes restarts all pods, and during restart the pods still cannot reach the database, causing a CrashLoopBackOff restart loop. Liveness probes should only check the process itself (is the HTTP server responding?); external dependency checks belong in readiness probes, where failure removes traffic routing without restarting the container.
  • Setting probe timeout equal to or greater than the interval: If the probe timeout is 10s and the interval is 10s, a slow health check causes probes to overlap, and under load the health check backlog grows making the pod appear permanently unhealthy. Set timeout significantly lower than interval (e.g., timeout 3s, interval 10s, failure threshold 3) for 30 seconds of sustained failure before action with clear gaps between probes.
  • No startup probe for slow-starting applications: A Java service that takes 60 seconds to load a large ML model gets killed by the liveness probe before it finishes starting, entering CrashLoopBackOff. Add a startup probe with a generous failure threshold (e.g., period 5s x failureThreshold 24 = 2 minutes); liveness and readiness probes are disabled until the startup probe passes.
  • Health check endpoint does too much work: The /health endpoint runs a DB query, calls Redis PING, checks disk usage, and validates TLS certificates, and under load this endpoint itself becomes slow causing false probe failures. Keep health check endpoints fast (<100ms) -- liveness returns 200 if the server is listening, readiness checks connection pool availability (not a live query), and a separate /health/detailed endpoint handles debugging.
Related Concepts

See Health Checks & Readiness Probes in action

Explore system design templates that use health checks & readiness probes and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Simulate liveness and readiness probe failures during deploys

Metrics to watch
probe_failure_ratepod_restart_counttraffic_drain_time_msavailability_pct
Run Simulation
Test Your Understanding

1What happens when a Kubernetes readiness probe fails?

2Why should liveness probes NOT check external dependencies like databases?

Deeper Reading