Vetora logo
๐ŸšจObservability

Alerting & On-Call

Alerting converts observability signals into actionable notifications. Effective alerting is symptom-based (alert on user impact, not internal metrics), respects severity tiers, and integrates with on-call rotation and incident management to minimize noise and maximize response speed.

Overview

Alerting is the bridge between observability data and human action. A well-designed alerting system detects user-facing problems quickly, routes them to the right person, and provides enough context for rapid diagnosis. A poorly designed system drowns engineers in noise, causing alert fatigue that leads to real incidents being ignored.

The fundamental principle of modern alerting is: alert on symptoms, not causes. A 'symptom' is something the user experiences: elevated error rate, high latency, data staleness. A 'cause' is an internal system state: high CPU, low disk space, connection pool exhaustion. Cause-based alerts are fragile -- CPU can spike during garbage collection (harmless) or a connection pool can empty during a deployment (transient). Symptom-based alerts fire only when users are actually affected.

The Google SRE approach to alerting is based on error budget burn rate. Instead of alerting when the error rate exceeds a static threshold, you alert when the rate of error budget consumption would exhaust the budget before the window expires. A fast burn rate (consuming the entire monthly budget in 1 hour) triggers an immediate page. A slow burn rate (consuming the budget over 3 days) triggers a ticket for next-business-day investigation. Multi-window alerting uses short (5m) and long (1h) windows together to reduce false positives: both windows must exceed the threshold to fire.

Severity classification determines routing. P1 (critical): user-facing outage, pages on-call immediately, requires response within 5 minutes. P2 (high): significant degradation, pages during business hours, 30-minute response. P3 (medium): minor impact, creates a ticket, next-business-day. P4 (low): informational, logged but no notification. Every alert must have exactly one severity level, and the classification must be automated based on the alert rule, not left to human judgment during an incident.

On-call rotation distributes the interrupt burden across the team. Effective rotations are 1-week shifts with a primary and secondary responder, handoff documentation, and a maximum of 2 pages per shift (on average). If the team receives more than 2 pages per week, the alerting system is too noisy and must be tuned before the rotation becomes unsustainable.

Key Points
  • 1Alert on symptoms (user impact), not causes (system internals). 'Error rate > 1% for 5 minutes' is better than 'CPU > 90%'. Users do not care about your CPU; they care that their requests fail.
  • 2Use multi-window, multi-burn-rate alerting. A 14.4x burn rate over 5 minutes (fast burn) and a 1x burn rate over 6 hours (slow burn) catch both sudden outages and gradual degradation while minimizing false positives.
  • 3Every alert must have a runbook: what the alert means, how to verify it is real, triage steps, escalation path. If you cannot write a runbook, the alert is not actionable and should not exist.
  • 4Target โ‰ค2 pages per on-call shift. More than that causes alert fatigue, where engineers start ignoring or auto-acknowledging pages. Alert fatigue during a real incident causes delayed response.
  • 5Use severity tiers: P1 pages immediately (5-min response), P2 pages during business hours (30-min response), P3 creates a ticket (next day). Automate severity classification in the alert rule.
  • 6Route alerts through an incident management platform (PagerDuty, Opsgenie, Rootly) that handles escalation, deduplication, and post-incident tracking. Do not send alerts directly to Slack -- messages get lost.
Simple Example

Error Budget Burn Rate Alert

A service has a 99.9% monthly SLO (error budget: 43 minutes). The alerting system monitors two windows: a 5-minute window at 14.4x burn rate (would exhaust the budget in 1 hour if sustained) and a 1-hour window at 6x burn rate (would exhaust in 5 hours). At 2:00 AM, a bad deployment causes 5% errors. The 5-minute window hits 14.4x immediately. The 1-hour window hits 6x after 10 minutes. Both windows exceeding their thresholds triggers a P1 page. The on-call engineer receives a page with: 'Orders API error budget burning at 14.4x. 5-minute error rate: 5%. Runbook: link.' They roll back the deployment within 8 minutes, consuming only 0.4 minutes of the 43-minute monthly budget.

Real-World Examples

Google

Google's SRE teams use error budget burn rate as the primary alerting mechanism. They define multiple burn rates per SLO (fast: 14.4x/5min, medium: 6x/30min, slow: 1x/6h) with corresponding severities (page, ticket, log). This approach reduced Google's false-positive page rate by over 90% compared to static threshold alerting. The methodology is documented in the SRE Workbook chapter on alerting.

PagerDuty

PagerDuty's own internal SRE team publishes their on-call practices: 1-week primary rotations with a shadow secondary, maximum 2 pages per shift as a health metric, mandatory runbooks for every alert, and quarterly on-call reviews where noisy alerts are deleted. They found that reducing page volume from 8/week to 2/week cut incident response time by 40% because engineers stayed fresh and focused.

Stripe

Stripe uses a severity system (SEV-0 through SEV-3) with automated routing. SEV-0 (complete outage) triggers an all-hands war room. SEV-1 pages the primary and secondary on-call plus the engineering manager. SEV-2 pages during business hours. SEV-3 creates a JIRA ticket. Stripe's observability team reviews every on-call shift and eliminates alerts that required no action.

Trade-Offs
AspectDescription
Sensitivity vs. NoiseTight alert thresholds catch issues faster but produce more false positives. Loose thresholds reduce noise but delay detection. Multi-window burn rate alerting balances both by requiring sustained impact before firing.
Automated Remediation vs. Human JudgmentAuto-remediation (restart pods, scale up, roll back) fixes known issues instantly but can cause harm if the diagnosis is wrong. Human-in-the-loop is safer but slower. Best practice: auto-remediate for well-understood issues (OOM restart) and page for novel failures.
Centralized vs. Team-Owned AlertingCentralized alerting (platform team manages all rules) ensures consistency but creates a bottleneck. Team-owned alerting (each team manages their service's alerts) is faster but leads to inconsistent practices and configuration drift.
On-Call Breadth vs. DepthWide rotations (entire team on-call for all services) distribute burden but require broad knowledge. Narrow rotations (specialists per service) provide deep expertise but create single points of failure and burnout for small teams.
Case Study

Honeycomb Eliminates 80% of Alerts by Switching to SLO-Based Alerting

Scenario

Honeycomb's engineering team was receiving 15-20 pages per week across their on-call rotation, with 70% requiring no action. Engineers were chronically fatigued and response times were degrading. The majority of false-positive pages originated from static threshold alerts on infrastructure metrics like CPU and memory utilization.

Solution

They migrated from static threshold alerts ('queue depth > 1000', 'CPU > 85%') to SLO-based burn rate alerting. They defined 12 SLOs covering their core data pipeline and query engine, with multi-window burn rates.

Outcome

Page volume dropped from 15-20/week to 3-4/week, with 95% of remaining pages requiring genuine intervention. MTTR improved by 50% because engineers trusted that pages were real and responded immediately instead of first checking whether the alert was noise.

Common Mistakes
  • โš Alerting on every metric threshold: Hundreds of alerts fire during any incident (CPU high, memory high, queue growing, latency up, error rate up -- all symptoms of the same root cause), overwhelming the on-call engineer. Alert on the user-facing symptom (error rate, latency SLO burn) and use dependent/grouped alerts so that if a P1 fires, related P3s are suppressed -- one incident should produce one page.
  • โš No runbooks attached to alerts: An alert fires at 3 AM and the on-call engineer has never seen it before, spending 30 minutes figuring out what it means before starting to debug. Every alert rule must include a runbook link containing what the alert means, how to verify it is real, most common causes, step-by-step triage, and escalation contacts.
  • โš Sending alerts to shared Slack channels instead of paging: Critical alerts get buried in a noisy channel with 200 messages/day, and during an incident the alert sits unacknowledged for 45 minutes because everyone assumed someone else would handle it. Route P1/P2 alerts through PagerDuty or Opsgenie with explicit on-call assignment; Slack can mirror alerts for visibility but the primary notification must go to a named individual with acknowledgment tracking.
  • โš Never tuning or deleting alerts: Alert rules accumulate over years, outdated alerts for decommissioned services fire regularly, and engineers stop trusting alerts because 80% are noise. Conduct quarterly alert reviews and for each alert ask 'Did this fire in the last 90 days? When it fired, was action required?' -- if the answer is 'no' to either, delete or reclassify it.
Related Concepts

See Alerting & On-Call in action

Explore system design templates that use alerting & on-call and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Configure alert thresholds and observe false-positive rates

Metrics to watch
alert_fire_countfalse_positive_ratetime_to_detect_msescalation_count
Run Simulation
Test Your Understanding

1Why is alerting on 'CPU > 90%' considered a bad practice?

2What is the benefit of multi-window burn rate alerting?

Deeper Reading