categories.observability Intermediate

How do you design an effective alerting strategy? How do you avoid alert fatigue?

AI Practice

The Harm of Alert Fatigue

Too many or low-quality alerts cause on-call engineers to become desensitized — and real emergencies get ignored.

Good Alerting Principles

Symptom-oriented, not cause-oriented:

  • Bad: CPU utilization > 80% (cause metric — might be a normal peak)
  • Good: User-visible error rate > 1% (symptom metric — directly impacts users)

Alerts must be actionable: Every alert should have a corresponding Runbook explaining what to do when it fires. An alert with no clear action is just noise.

Set reasonable thresholds: Avoid being overly sensitive (alerting on every brief CPU spike). Use rolling windows (e.g., "5-minute average > threshold") to reduce false positives.

Tiered severity:

  • P1 (Critical): Service immediately broken, requires immediate response (e.g., website completely down)
  • P2 (Warning): Performance degradation, respond within 1 hour
  • P3 (Info): Needs attention but not urgent

Noise Reduction Strategies

Alert grouping: Merge multiple alerts from the same root cause into one notification

Alert suppression: Silence alerts during maintenance windows

Dependency awareness: When the database goes down, avoid flooding with derivative alerts from all upstream services

Alert Quality Assessment

Regularly review alert history:

  • What percentage of alerts were silenced? → Too much noise
  • How many incidents occurred without triggering an alert? → Insufficient coverage
  • What's the average time from alert to response? → On-call burden indicator

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub