What are SLI, SLO, and SLA? How do you define them in practice?
Definitions
SLI (Service Level Indicator) A specific quantitative metric that measures service quality.
Common SLI types:
- Availability: Successful requests / Total requests
- Latency: Proportion of requests where P99 response time < 200ms
- Error rate: 5xx errors / Total requests
- Throughput: Requests processed per second
SLO (Service Level Objective) The target value for an SLI over a specific time window. This is an internal target.
Examples:
- Over the past 30 days, availability SLI >= 99.9%
- Over the past 7 days, proportion with P99 latency < 200ms >= 95%
SLA (Service Level Agreement) A formal commitment made to customers, including compensation clauses for violations. SLAs are typically looser than SLOs (leaving a buffer).
Relationship
SLI (measurement) → SLO (internal target) → SLA (external commitment)
Why SLO is looser than raw SLI: Systems can't be measured perfectly; measurement error exists. Why SLA is looser than SLO: SLO is an internal target — there should be room to fix issues before violating the SLA.
Error Budget
Error Budget = 1 - SLO
Example: If SLO = 99.9%, then monthly Error Budget = 0.1% × 43,200 minutes ≈ 43 minutes of downtime.
Error Budget is the core SRE tool: if the budget is healthy, you can accelerate feature releases; if the budget is exhausted, freeze releases and prioritize reliability improvements.
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
