categories.reliability-sre Basic

What is Toil in SRE? How do you identify and reduce it?

AI Practice

What Toil Is

Google SRE defines Toil as: manual, repetitive, automatable, tactical operational work that has no enduring value.

Key characteristics:

  • Manual (requires human intervention)
  • Repetitive (the same work over and over)
  • Can be automated
  • Grows linearly with service scale
  • No enduring value (doesn't improve the system)

Toil vs Engineering Work

Toil Engineering Work
Manually restarting a service Building an automatic restart mechanism
Manually handling capacity alerts Building auto-scaling
Manually deploying code Building a CI/CD pipeline
Weekly manual reports Building an automated reporting dashboard

The Harm of Toil

Labor cost: SREs should spend less than 50% of their time on Toil; the rest should go to engineering work (improving the system).

Burnout: Repetitive mindless work reduces engineer satisfaction.

Hinders scaling: Toil grows linearly with service scale, eventually becoming a scaling bottleneck.

How to Identify Toil

Track on-call engineers' work time, record which tasks consume the most time, and identify which are repetitive.

Strategies to Reduce Toil

  1. Automation: Write repetitive tasks as scripts or services
  2. Self-healing: Design systems to automatically detect and recover (Kubernetes liveness probes)
  3. Better design: Eliminate the root triggers that require human intervention
  4. Delegation: Some Toil can be handled by development teams via self-service tools

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub