Vetora logo
๐ŸŽฏObservability

SLOs, SLIs & SLAs

Service Level Indicators (SLIs) measure system behavior, Service Level Objectives (SLOs) set targets for those indicators, and Service Level Agreements (SLAs) are contractual commitments with consequences. Together they form the reliability contract between a service and its users.

Overview

The SLI/SLO/SLA framework, codified by Google's SRE practices, provides a principled way to define, measure, and enforce reliability targets. Before SLOs, reliability was either 'best effort' (no targets, no accountability) or '100% uptime' (unrealistic, stifling innovation). SLOs introduce a middle ground: a measurable reliability target that balances user happiness with engineering velocity.

A Service Level Indicator (SLI) is a quantitative measure of a service's behavior from the user's perspective. Good SLIs measure what users care about: request latency, error rate, throughput, and data freshness. They are expressed as ratios: 'the proportion of valid requests that completed successfully in less than 300ms.' The denominator is important -- it should include only requests the service was expected to handle (exclude health checks, internal probes).

A Service Level Objective (SLO) is a target value for an SLI over a compliance window. '99.9% of requests complete in < 300ms measured over a rolling 30-day window' is an SLO. The SLO implies an error budget: 1 - 0.999 = 0.001, meaning the service can fail 0.1% of requests (about 43,200 in a month of 1M requests/day) before the SLO is violated. The error budget is the key concept -- it makes reliability a resource that can be spent, not an absolute requirement.

A Service Level Agreement (SLA) is a contractual commitment to meet an SLO, with defined consequences (credits, refunds, contract termination) if breached. SLAs are typically less aggressive than internal SLOs: if your internal SLO is 99.95%, your SLA might be 99.9%, providing a buffer. Not every service needs an SLA -- internal services often have SLOs without contractual backing.

The error budget model transforms the reliability conversation. When the error budget is healthy (say, 80% remaining halfway through the month), the team has earned the right to take risks: deploy faster, run experiments, migrate infrastructure. When the budget is nearly exhausted, the team shifts to reliability mode: slower deployments, more testing, incident reviews. This creates a natural feedback loop that aligns incentives across product, engineering, and operations teams.

Key Points
  • 1SLIs measure user-facing behavior, not system internals. 'CPU utilization' is NOT an SLI. 'Proportion of requests returning 2xx in < 300ms' IS an SLI. Always measure at the boundary closest to the user (load balancer, CDN edge).
  • 2Error budget = 1 - SLO target. A 99.9% SLO gives you 0.1% error budget = 43.2 minutes of total downtime per 30 days, or ~8,640 failed requests per million. This budget is spent on deployments, experiments, migrations, and incidents.
  • 3SLOs should be set based on user happiness, not system capability. If users tolerate 500ms latency but your system can do 50ms, set the SLO at 300ms (with headroom), not 50ms. Over-aggressive SLOs burn error budget on irrelevant optimization.
  • 4Use multiple SLIs per service: availability (success rate), latency (p50, p99), correctness (data freshness), and throughput. A service can be fast but returning stale data -- a single SLI would miss this.
  • 5Compliance windows (typically 28 or 30 days rolling) smooth out transient blips. A 5-minute outage on a 99.9% SLO consumes only 0.35% of the monthly error budget -- significant but not catastrophic.
  • 6SLAs should be less aggressive than internal SLOs. If your SLO is 99.95%, your SLA should be 99.9% or 99.5%. The gap is your safety margin for unexpected incidents.
Simple Example

An API Team's SLO Dashboard

The orders API team defines: SLI = proportion of non-5xx responses with latency < 500ms. SLO = 99.9% over 30 days. Error budget = 0.1% = 43,200 bad requests at 1M requests/day. On day 15, a bad deployment causes 5 minutes of 50% errors, burning 2,500 of the 43,200 budget (5.8%). The SLO dashboard shows 94.2% budget remaining with 15 days left -- healthy. The team proceeds with planned feature work. If instead a major outage burned 80% of the budget, the team would freeze feature deployments and focus on reliability.

Real-World Examples

Google

Google pioneered the SLO/error-budget model in their SRE organization. Each Google service has defined SLOs, and the error budget determines whether the team is allowed to push new features or must focus on reliability. Google's public SLA for Cloud services (e.g., 99.95% for Compute Engine) is deliberately less aggressive than internal SLOs, providing a contractual buffer.

Slack

Slack publishes a 99.99% availability SLA for Enterprise Grid customers. Internally, their SLO is higher. When an incident burns error budget, a formal review determines whether to slow down feature deployments. Slack's status page (status.slack.com) reports SLI data in real time, building trust by making reliability transparent.

Amazon

AWS services publish SLAs with service credit penalties. For example, S3's SLA guarantees 99.9% availability per month -- if availability drops below 99.0%, customers receive a 25% service credit. These SLA tiers (99.9%, 99.0%, <99.0%) with escalating credits create a clear financial incentive for AWS to meet targets.

Trade-Offs
AspectDescription
Aggressive SLO vs. Engineering VelocityA 99.99% SLO gives only 4.3 minutes of downtime/month. Every deployment risks burning the budget, so teams deploy less frequently. A 99.9% SLO gives 43 minutes -- 10x more room for experiments and fast iteration.
User-Measured vs. Server-Measured SLIsMeasuring SLIs at the user (browser, mobile app) captures the true experience including CDN, DNS, and client-side rendering. But client-side measurement is noisy, hard to aggregate, and introduces a reporting lag. Server-side measurement is cleaner but misses last-mile issues.
Per-Customer vs. Aggregate SLOsAggregate SLOs can hide that one large customer is getting 95% availability while everyone else gets 99.99%. Per-customer SLOs are fairer but dramatically increase operational complexity (thousands of independent error budgets to track).
Contractual SLA vs. No SLASLAs create legal accountability and build customer trust, but they also create financial risk and limit operational flexibility. Internal services typically have SLOs without SLAs to maintain accountability without legal overhead.
Case Study

How Datadog Uses Error Budgets to Balance Shipping Speed and Reliability

Scenario

Datadog's internal platform team manages SLOs for their metrics ingestion pipeline, which processes trillions of data points per day. They set a 99.95% availability SLO with a 30-day rolling window, giving them ~21.6 minutes of error budget per month. During a quarter where they were migrating storage backends, they consumed 60% of their error budget in the first two weeks due to migration-related latency spikes.

Solution

The error budget policy automatically triggered a 'reliability sprint' -- all feature work was paused, and the team focused on stabilizing the migration path with better rollback automation and canary deployments.

Outcome

After two weeks of reliability work, the budget recovered, and feature development resumed. The team credits the error budget model with preventing a major outage: without the automatic trigger, they would have continued the aggressive migration timeline and likely exhausted the budget entirely.

Common Mistakes
  • โš Setting SLOs at 100%: A 100% SLO means zero error budget, so any single failed request violates the SLO and teams become paralyzed -- unable to deploy, migrate, or experiment. Set SLOs based on user tolerance, not perfection; start with 99.9% (43 min/month budget) and tighten only if users report dissatisfaction.
  • โš Measuring SLIs on server-side metrics that don't reflect user experience: CPU at 30% and all health checks passing, but users see timeouts because a downstream dependency is slow, so the SLO dashboard shows green while users experience outages. Measure SLIs at the edge (load balancer or CDN) to capture the full request lifecycle from the user's perspective.
  • โš No error budget policy: The team defines SLOs but has no consequences when the budget is exhausted, so feature deployments continue during budget-critical periods leading to SLO violations. Define an explicit error budget policy ('when error budget falls below 20%, freeze non-critical deployments and prioritize reliability work') with product and engineering leadership sign-off.
  • โš Using SLOs as performance goals instead of reliability floors: Teams optimize to exceed the SLO by 10x (e.g., achieving 99.999% when the SLO is 99.9%), wasting engineering effort on reliability that users don't notice. SLOs are floors, not ceilings -- if you consistently exceed your SLO by a wide margin, tighten it or redirect the excess engineering effort to feature work.
Related Concepts

See SLOs, SLIs & SLAs in action

Explore system design templates that use slos, slis & slas and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Set SLOs and watch error budget burn under load

Metrics to watch
error_budget_remaining_pctsli_latency_p99_msavailability_pctburn_rate
Run Simulation
Test Your Understanding

1What does an error budget of 0.1% over a 30-day window mean in practice?

2Why should SLAs typically be less aggressive than internal SLOs?

Deeper Reading