1Why is alerting on 'CPU > 90%' considered a bad practice?
Alerting converts observability signals into actionable notifications. Effective alerting is symptom-based (alert on user impact, not internal metrics), respects severity tiers, and integrates with on-call rotation and incident management to minimize noise and maximize response speed.
Alerting is the bridge between observability data and human action. A well-designed alerting system detects user-facing problems quickly, routes them to the right person, and provides enough context for rapid diagnosis. A poorly designed system drowns engineers in noise, causing alert fatigue that leads to real incidents being ignored.
The fundamental principle of modern alerting is: alert on symptoms, not causes. A 'symptom' is something the user experiences: elevated error rate, high latency, data staleness. A 'cause' is an internal system state: high CPU, low disk space, connection pool exhaustion. Cause-based alerts are fragile -- CPU can spike during garbage collection (harmless) or a connection pool can empty during a deployment (transient). Symptom-based alerts fire only when users are actually affected.
The Google SRE approach to alerting is based on error budget burn rate. Instead of alerting when the error rate exceeds a static threshold, you alert when the rate of error budget consumption would exhaust the budget before the window expires. A fast burn rate (consuming the entire monthly budget in 1 hour) triggers an immediate page. A slow burn rate (consuming the budget over 3 days) triggers a ticket for next-business-day investigation. Multi-window alerting uses short (5m) and long (1h) windows together to reduce false positives: both windows must exceed the threshold to fire.
Severity classification determines routing. P1 (critical): user-facing outage, pages on-call immediately, requires response within 5 minutes. P2 (high): significant degradation, pages during business hours, 30-minute response. P3 (medium): minor impact, creates a ticket, next-business-day. P4 (low): informational, logged but no notification. Every alert must have exactly one severity level, and the classification must be automated based on the alert rule, not left to human judgment during an incident.
On-call rotation distributes the interrupt burden across the team. Effective rotations are 1-week shifts with a primary and secondary responder, handoff documentation, and a maximum of 2 pages per shift (on average). If the team receives more than 2 pages per week, the alerting system is too noisy and must be tuned before the rotation becomes unsustainable.
Error Budget Burn Rate Alert
A service has a 99.9% monthly SLO (error budget: 43 minutes). The alerting system monitors two windows: a 5-minute window at 14.4x burn rate (would exhaust the budget in 1 hour if sustained) and a 1-hour window at 6x burn rate (would exhaust in 5 hours). At 2:00 AM, a bad deployment causes 5% errors. The 5-minute window hits 14.4x immediately. The 1-hour window hits 6x after 10 minutes. Both windows exceeding their thresholds triggers a P1 page. The on-call engineer receives a page with: 'Orders API error budget burning at 14.4x. 5-minute error rate: 5%. Runbook: link.' They roll back the deployment within 8 minutes, consuming only 0.4 minutes of the 43-minute monthly budget.
Google's SRE teams use error budget burn rate as the primary alerting mechanism. They define multiple burn rates per SLO (fast: 14.4x/5min, medium: 6x/30min, slow: 1x/6h) with corresponding severities (page, ticket, log). This approach reduced Google's false-positive page rate by over 90% compared to static threshold alerting. The methodology is documented in the SRE Workbook chapter on alerting.
PagerDuty
PagerDuty's own internal SRE team publishes their on-call practices: 1-week primary rotations with a shadow secondary, maximum 2 pages per shift as a health metric, mandatory runbooks for every alert, and quarterly on-call reviews where noisy alerts are deleted. They found that reducing page volume from 8/week to 2/week cut incident response time by 40% because engineers stayed fresh and focused.
Stripe
Stripe uses a severity system (SEV-0 through SEV-3) with automated routing. SEV-0 (complete outage) triggers an all-hands war room. SEV-1 pages the primary and secondary on-call plus the engineering manager. SEV-2 pages during business hours. SEV-3 creates a JIRA ticket. Stripe's observability team reviews every on-call shift and eliminates alerts that required no action.
| Aspect | Description |
|---|---|
| Sensitivity vs. Noise | Tight alert thresholds catch issues faster but produce more false positives. Loose thresholds reduce noise but delay detection. Multi-window burn rate alerting balances both by requiring sustained impact before firing. |
| Automated Remediation vs. Human Judgment | Auto-remediation (restart pods, scale up, roll back) fixes known issues instantly but can cause harm if the diagnosis is wrong. Human-in-the-loop is safer but slower. Best practice: auto-remediate for well-understood issues (OOM restart) and page for novel failures. |
| Centralized vs. Team-Owned Alerting | Centralized alerting (platform team manages all rules) ensures consistency but creates a bottleneck. Team-owned alerting (each team manages their service's alerts) is faster but leads to inconsistent practices and configuration drift. |
| On-Call Breadth vs. Depth | Wide rotations (entire team on-call for all services) distribute burden but require broad knowledge. Narrow rotations (specialists per service) provide deep expertise but create single points of failure and burnout for small teams. |
Honeycomb Eliminates 80% of Alerts by Switching to SLO-Based Alerting
Scenario
Honeycomb's engineering team was receiving 15-20 pages per week across their on-call rotation, with 70% requiring no action. Engineers were chronically fatigued and response times were degrading. The majority of false-positive pages originated from static threshold alerts on infrastructure metrics like CPU and memory utilization.
Solution
They migrated from static threshold alerts ('queue depth > 1000', 'CPU > 85%') to SLO-based burn rate alerting. They defined 12 SLOs covering their core data pipeline and query engine, with multi-window burn rates.
Outcome
Page volume dropped from 15-20/week to 3-4/week, with 95% of remaining pages requiring genuine intervention. MTTR improved by 50% because engineers trusted that pages were real and responded immediately instead of first checking whether the alert was noise.
See Alerting & On-Call in action
Explore system design templates that use alerting & on-call and run traffic simulations to see how these concepts perform under real load.
Browse Templates1Why is alerting on 'CPU > 90%' considered a bad practice?
2What is the benefit of multi-window burn rate alerting?