SLI, SLO, SLA: Measuring and Committing to Reliability
Your team deploys a change. The API error rate spikes to 2% for 10 minutes. Is that acceptable? Without defined reliability targets, you cannot answer that question. You argue about whether it was “bad enough” to roll back. You have no data to guide the decision.
SLOs give you that data. They define what “good enough” means before the incident happens. Error budgets tell you how much reliability you can afford to spend on new features. The framework turns reliability from a vague aspiration into a measurable, manageable property.
SLI: Service Level Indicator
An SLI is a quantitative measure of some aspect of the service’s behavior. It is the metric you use to assess reliability.
Good SLIs measure what users experience:
- Request success rate (percentage of requests that return a non-5xx response)
- Request latency (percentage of requests that complete within N milliseconds)
- Availability (percentage of time the service is serving requests)
- Data freshness (percentage of data that is less than N seconds old)
Bad SLIs measure internal system state:
- CPU utilization (users do not care about CPU)
- Memory usage (users do not care about memory)
- Number of servers running (users do not care about infrastructure)
The test: would a user notice if this metric changed? If yes, it is a good SLI candidate.
SLO: Service Level Objective
An SLO is a target value for an SLI. It defines what “good” looks like.
Examples:
- 99.9% of requests succeed (success rate SLO)
- 95% of requests complete within 200ms (latency SLO)
- Service is available 99.95% of the time (availability SLO)
- 99% of data is less than 10 seconds old (freshness SLO)
SLOs are internal targets. They are not commitments to customers (that is the SLA). They are the targets your team commits to maintaining.
Setting SLOs:
- Start with what users actually need, not what you think you can achieve
- Look at historical data: what has your service actually achieved?
- Set the SLO slightly below your historical performance (leave room for incidents)
- Make SLOs achievable but not trivial
graph TB subgraph hierarchy["SLI, SLO, SLA Hierarchy"] SLI2["SLI What you measure Request success rate: 99.95%"] SLO2["SLO What you target Success rate >= 99.9%"] SLA2["SLA What you promise Success rate >= 99.5% (with consequences)"] SLI2 -->|"measured against"| SLO2 SLO2 -->|"basis for"| SLA2 end style SLI2 fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style SLO2 fill:#E1F5EE,stroke:#0F6E56,color:#085041 style SLA2 fill:#FAEEDA,stroke:#854F0B,color:#633806
SLA: Service Level Agreement
An SLA is a contractual commitment to customers. It specifies what happens if the SLO is not met (service credits, refunds, contract termination).
Key differences from SLO:
- SLOs are internal targets. SLAs are external commitments.
- SLAs are typically less strict than SLOs (you need buffer between your target and your commitment)
- SLAs have consequences for violation. SLOs do not (directly).
Example:
- SLO: 99.9% availability (internal target)
- SLA: 99.5% availability (customer commitment, with service credits if violated)
The gap between SLO and SLA is your safety margin. If you are at 99.7% availability, you are below your SLO but above your SLA. You need to improve, but you have not violated your customer commitment.
Error budgets
An error budget is the amount of unreliability you are allowed to have while still meeting your SLO.
Calculation:
- SLO: 99.9% success rate
- Error budget: 100% - 99.9% = 0.1% of requests can fail
- In a month with 1 billion requests: 1 million requests can fail
The error budget is a shared resource between reliability and velocity. Every incident consumes error budget. Every deployment that causes errors consumes error budget.
Using error budgets:
- If you have error budget remaining: you can take risks (deploy new features, run experiments)
- If you have consumed your error budget: freeze deployments, focus on reliability
This creates a natural incentive alignment: the team that wants to ship features also wants to maintain reliability, because burning the error budget means no more deployments.
graph LR subgraph budget["Error Budget Usage"] TOTAL["Monthly error budget 0.1% of requests = 1M requests"] INC1["Incident 1 200K requests failed 20% of budget"] INC2["Incident 2 500K requests failed 50% of budget"] DEPLOY["Deployment 100K requests failed 10% of budget"] REMAIN["Remaining budget 200K requests 20% of budget"] TOTAL --> INC1 TOTAL --> INC2 TOTAL --> DEPLOY TOTAL --> REMAIN end style TOTAL fill:#EEEDFE,stroke:#534AB7,color:#3C3489 style INC1 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style INC2 fill:#FCEBEB,stroke:#A32D2D,color:#791F1F style DEPLOY fill:#FAEEDA,stroke:#854F0B,color:#633806 style REMAIN fill:#E1F5EE,stroke:#0F6E56,color:#085041
Where it breaks or gets interesting
Measuring the right thing
A 99.9% success rate SLO sounds good. But if 0.1% of requests are the most important ones (checkout, payment), users notice. Weight your SLIs by user impact. A failed checkout is worse than a failed recommendation.
The SLO window
SLOs are measured over a time window (rolling 30 days, calendar month). A rolling window is more stable (no cliff effects at month boundaries). A calendar month aligns with billing cycles.
Latency SLOs
Latency SLOs should be percentile-based, not average-based. “Average latency under 100ms” is meaningless if 1% of requests take 10 seconds. Use “99% of requests complete within 200ms.”
SLOs for dependencies
Your SLO depends on your dependencies’ SLOs. If your service calls three services each with 99.9% availability, your theoretical availability is 0.999^3 = 99.7%. Set your SLO accordingly.
Toil and error budgets
If your team spends all their time on manual reliability work (toil), they have no time to improve the system. Error budgets help: if the error budget is being consumed by toil (manual interventions, on-call pages), that is a signal to automate.
Real-world systems
Google - Pioneered SRE (Site Reliability Engineering) and the SLO/error budget framework. Described in the SRE book (free online).
AWS - Publishes SLAs for all services. EC2 SLA: 99.99% monthly uptime. Violation results in service credits.
Stripe - Publishes SLAs for their API. Uses error budgets internally to balance reliability and velocity.
Cloudflare - 100% uptime SLA for enterprise customers. Backed by a global network with extensive redundancy.
How to apply it in practice
Defining your first SLOs
- Identify your critical user journeys: What are the most important things users do? Login, checkout, search.
- Define SLIs for each journey: What metrics capture whether the journey is working? Success rate, latency.
- Look at historical data: What has your service actually achieved over the last 90 days?
- Set SLOs slightly below historical performance: If you have achieved 99.95% success rate, set the SLO at 99.9%.
- Calculate error budgets: How much unreliability does the SLO allow?
- Build dashboards: Track SLI vs SLO in real time. Show error budget remaining.
SLO-based alerting
Alert when you are burning through your error budget too fast, not when you cross a threshold.
Burn rate alerting: If you are consuming your monthly error budget at 14x the normal rate, you will exhaust it in 2 days. Alert on burn rate, not on instantaneous error rate.
Multi-window alerting: Alert when both a short window (1 hour) and a long window (6 hours) show elevated burn rate. This reduces false positives from brief spikes.
Communicating SLOs
Share SLOs with stakeholders. When a product manager wants to ship a risky feature, show them the error budget. “We have 20% of our error budget remaining this month. This deployment has a 30% chance of consuming it. Do you want to proceed?”
This makes reliability a shared responsibility, not just an engineering concern.
FAQ
Q: What is the difference between availability and reliability?
Availability is the percentage of time the system is operational. Reliability is the probability that the system performs its intended function correctly. A system can be available (responding to requests) but unreliable (returning wrong results). For most SLOs, you want to measure both: availability (is the service responding?) and correctness (are the responses correct?).
Q: Should SLOs be 100%?
No. 100% SLOs are impossible to achieve and create perverse incentives. If any failure violates the SLO, teams become risk-averse and stop shipping. The right SLO is the minimum reliability that users need. Users can tolerate some failures. The question is: how many?
Q: How do you handle planned maintenance in SLO calculations?
Planned maintenance windows can be excluded from SLO calculations if they are communicated in advance and users can plan around them. But modern systems should aim for zero-downtime deployments and maintenance, making this less relevant. If you need maintenance windows, they should be rare and short.
Interview questions
Q1: Your team’s SLO is 99.9% availability. You have consumed 80% of your error budget this month. A product manager wants to deploy a new feature. How do you handle this?
Strong answer: With 20% of the error budget remaining and 10 days left in the month, you have limited room for risk. Calculate the remaining budget in concrete terms: if your monthly budget is 43 minutes of downtime, you have 8.6 minutes left. Assess the deployment risk: what is the probability of an incident? How long would it take to detect and roll back? If the deployment has a 10% chance of causing a 30-minute incident, the expected error budget consumption is 3 minutes - within your remaining budget. If the risk is higher, delay the deployment until next month when the budget resets. Present this analysis to the product manager. The error budget framework makes the tradeoff explicit and data-driven, not a judgment call.
Q2: How do you define an SLO for a batch processing job?
Strong answer: Batch jobs have different SLIs than request-serving systems. Instead of availability and latency, measure: completion rate (percentage of jobs that complete successfully), freshness (how old is the output data?), and duration (does the job complete within the expected time window?). Example SLOs: 99.9% of daily batch jobs complete successfully, output data is less than 26 hours old (allowing for a 2-hour processing window), 95% of jobs complete within 4 hours. Alert when a job fails (completion rate SLO), when output data is stale (freshness SLO), or when a job is running longer than expected (duration SLO).
Q3: Explain the concept of error budget burn rate and how you use it for alerting.
Strong answer: Error budget burn rate is how fast you are consuming your error budget relative to the normal rate. A burn rate of 1 means you are consuming the budget at exactly the rate that would exhaust it at the end of the SLO window. A burn rate of 14 means you are consuming it 14x faster - you will exhaust the monthly budget in 2 days instead of 30. Alert on burn rate rather than instantaneous error rate because: a brief spike in errors might not be significant (low burn rate), but a sustained moderate error rate might exhaust the budget (high burn rate). Use multi-window alerting: alert when both a short window (1 hour) and a long window (6 hours) show a burn rate above a threshold. The short window catches fast-burning incidents. The long window catches slow burns that would not trigger a short-window alert. This reduces false positives while catching real problems.