1What does an error budget of 0.1% over a 30-day window mean in practice?
Service Level Indicators (SLIs) measure system behavior, Service Level Objectives (SLOs) set targets for those indicators, and Service Level Agreements (SLAs) are contractual commitments with consequences. Together they form the reliability contract between a service and its users.
The SLI/SLO/SLA framework, codified by Google's SRE practices, provides a principled way to define, measure, and enforce reliability targets. Before SLOs, reliability was either 'best effort' (no targets, no accountability) or '100% uptime' (unrealistic, stifling innovation). SLOs introduce a middle ground: a measurable reliability target that balances user happiness with engineering velocity.
A Service Level Indicator (SLI) is a quantitative measure of a service's behavior from the user's perspective. Good SLIs measure what users care about: request latency, error rate, throughput, and data freshness. They are expressed as ratios: 'the proportion of valid requests that completed successfully in less than 300ms.' The denominator is important -- it should include only requests the service was expected to handle (exclude health checks, internal probes).
A Service Level Objective (SLO) is a target value for an SLI over a compliance window. '99.9% of requests complete in < 300ms measured over a rolling 30-day window' is an SLO. The SLO implies an error budget: 1 - 0.999 = 0.001, meaning the service can fail 0.1% of requests (about 43,200 in a month of 1M requests/day) before the SLO is violated. The error budget is the key concept -- it makes reliability a resource that can be spent, not an absolute requirement.
A Service Level Agreement (SLA) is a contractual commitment to meet an SLO, with defined consequences (credits, refunds, contract termination) if breached. SLAs are typically less aggressive than internal SLOs: if your internal SLO is 99.95%, your SLA might be 99.9%, providing a buffer. Not every service needs an SLA -- internal services often have SLOs without contractual backing.
The error budget model transforms the reliability conversation. When the error budget is healthy (say, 80% remaining halfway through the month), the team has earned the right to take risks: deploy faster, run experiments, migrate infrastructure. When the budget is nearly exhausted, the team shifts to reliability mode: slower deployments, more testing, incident reviews. This creates a natural feedback loop that aligns incentives across product, engineering, and operations teams.
An API Team's SLO Dashboard
The orders API team defines: SLI = proportion of non-5xx responses with latency < 500ms. SLO = 99.9% over 30 days. Error budget = 0.1% = 43,200 bad requests at 1M requests/day. On day 15, a bad deployment causes 5 minutes of 50% errors, burning 2,500 of the 43,200 budget (5.8%). The SLO dashboard shows 94.2% budget remaining with 15 days left -- healthy. The team proceeds with planned feature work. If instead a major outage burned 80% of the budget, the team would freeze feature deployments and focus on reliability.
Google pioneered the SLO/error-budget model in their SRE organization. Each Google service has defined SLOs, and the error budget determines whether the team is allowed to push new features or must focus on reliability. Google's public SLA for Cloud services (e.g., 99.95% for Compute Engine) is deliberately less aggressive than internal SLOs, providing a contractual buffer.
Slack
Slack publishes a 99.99% availability SLA for Enterprise Grid customers. Internally, their SLO is higher. When an incident burns error budget, a formal review determines whether to slow down feature deployments. Slack's status page (status.slack.com) reports SLI data in real time, building trust by making reliability transparent.
Amazon
AWS services publish SLAs with service credit penalties. For example, S3's SLA guarantees 99.9% availability per month -- if availability drops below 99.0%, customers receive a 25% service credit. These SLA tiers (99.9%, 99.0%, <99.0%) with escalating credits create a clear financial incentive for AWS to meet targets.
| Aspect | Description |
|---|---|
| Aggressive SLO vs. Engineering Velocity | A 99.99% SLO gives only 4.3 minutes of downtime/month. Every deployment risks burning the budget, so teams deploy less frequently. A 99.9% SLO gives 43 minutes -- 10x more room for experiments and fast iteration. |
| User-Measured vs. Server-Measured SLIs | Measuring SLIs at the user (browser, mobile app) captures the true experience including CDN, DNS, and client-side rendering. But client-side measurement is noisy, hard to aggregate, and introduces a reporting lag. Server-side measurement is cleaner but misses last-mile issues. |
| Per-Customer vs. Aggregate SLOs | Aggregate SLOs can hide that one large customer is getting 95% availability while everyone else gets 99.99%. Per-customer SLOs are fairer but dramatically increase operational complexity (thousands of independent error budgets to track). |
| Contractual SLA vs. No SLA | SLAs create legal accountability and build customer trust, but they also create financial risk and limit operational flexibility. Internal services typically have SLOs without SLAs to maintain accountability without legal overhead. |
How Datadog Uses Error Budgets to Balance Shipping Speed and Reliability
Scenario
Datadog's internal platform team manages SLOs for their metrics ingestion pipeline, which processes trillions of data points per day. They set a 99.95% availability SLO with a 30-day rolling window, giving them ~21.6 minutes of error budget per month. During a quarter where they were migrating storage backends, they consumed 60% of their error budget in the first two weeks due to migration-related latency spikes.
Solution
The error budget policy automatically triggered a 'reliability sprint' -- all feature work was paused, and the team focused on stabilizing the migration path with better rollback automation and canary deployments.
Outcome
After two weeks of reliability work, the budget recovered, and feature development resumed. The team credits the error budget model with preventing a major outage: without the automatic trigger, they would have continued the aggressive migration timeline and likely exhausted the budget entirely.
See SLOs, SLIs & SLAs in action
Explore system design templates that use slos, slis & slas and run traffic simulations to see how these concepts perform under real load.
Browse Templates1What does an error budget of 0.1% over a 30-day window mean in practice?
2Why should SLAs typically be less aggressive than internal SLOs?