categories.reliability-sre Basic
What is Toil in SRE? How do you identify and reduce it?
What Toil Is
Google SRE defines Toil as: manual, repetitive, automatable, tactical operational work that has no enduring value.
Key characteristics:
- Manual (requires human intervention)
- Repetitive (the same work over and over)
- Can be automated
- Grows linearly with service scale
- No enduring value (doesn't improve the system)
Toil vs Engineering Work
| Toil | Engineering Work |
|---|---|
| Manually restarting a service | Building an automatic restart mechanism |
| Manually handling capacity alerts | Building auto-scaling |
| Manually deploying code | Building a CI/CD pipeline |
| Weekly manual reports | Building an automated reporting dashboard |
The Harm of Toil
Labor cost: SREs should spend less than 50% of their time on Toil; the rest should go to engineering work (improving the system).
Burnout: Repetitive mindless work reduces engineer satisfaction.
Hinders scaling: Toil grows linearly with service scale, eventually becoming a scaling bottleneck.
How to Identify Toil
Track on-call engineers' work time, record which tasks consume the most time, and identify which are repetitive.
Strategies to Reduce Toil
- Automation: Write repetitive tasks as scripts or services
- Self-healing: Design systems to automatically detect and recover (Kubernetes liveness probes)
- Better design: Eliminate the root triggers that require human intervention
- Delegation: Some Toil can be handled by development teams via self-service tools
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
