categories.reliability-sre Intermediate

What is the complete incident management process? What are the key elements of a Post-mortem?

AI Practice

Incident Management Lifecycle

1. Detection Discover the problem through monitoring alerts or user reports. Minimize MTTD (Mean Time to Detect).

2. Response On-call engineer receives the alert, assesses the impact scope, and declares the incident severity level (P1/P2/P3).

3. Coordination

  • Assign an Incident Commander (IC) to coordinate the overall response
  • Create an incident communication channel (Slack channel / bridge call)
  • Provide status updates to stakeholders

4. Mitigation Prioritize restoring service first, then find the root cause. Common mitigation actions: rollback deployment, switch traffic, restart service, use Feature Flags to disable the problematic feature.

5. Resolution Service returns to normal, incident is closed, timeline is documented.

6. Post-mortem Conducted within 48-72 hours of incident resolution.

Key Post-mortem Elements

Blameless culture: The goal is to improve the system, not assign blame to individuals. Avoid "who made the mistake" language.

Incident timeline: Reconstruct the complete sequence of events, from trigger to recovery.

Root Cause Analysis (RCA): Use the "5 Whys" technique to find the root cause, not just the surface symptom.

Action items: Each root cause should have a corresponding improvement measure with an assigned owner and due date.

Key metrics:

  • MTTD (Mean Time to Detect): Time to detect
  • MTTR (Mean Time to Recover): Time to recover

✦ AI Mock Interview

Type your answer and get instant AI feedback

Sign in to use AI scoring

Copyright © 2026 Wood All Rights Reserved · FE Interview Hub