What is the complete incident management process? What are the key elements of a Post-mortem?
Incident Management Lifecycle
1. Detection Discover the problem through monitoring alerts or user reports. Minimize MTTD (Mean Time to Detect).
2. Response On-call engineer receives the alert, assesses the impact scope, and declares the incident severity level (P1/P2/P3).
3. Coordination
- Assign an Incident Commander (IC) to coordinate the overall response
- Create an incident communication channel (Slack channel / bridge call)
- Provide status updates to stakeholders
4. Mitigation Prioritize restoring service first, then find the root cause. Common mitigation actions: rollback deployment, switch traffic, restart service, use Feature Flags to disable the problematic feature.
5. Resolution Service returns to normal, incident is closed, timeline is documented.
6. Post-mortem Conducted within 48-72 hours of incident resolution.
Key Post-mortem Elements
Blameless culture: The goal is to improve the system, not assign blame to individuals. Avoid "who made the mistake" language.
Incident timeline: Reconstruct the complete sequence of events, from trigger to recovery.
Root Cause Analysis (RCA): Use the "5 Whys" technique to find the root cause, not just the surface symptom.
Action items: Each root cause should have a corresponding improvement measure with an assigned owner and due date.
Key metrics:
- MTTD (Mean Time to Detect): Time to detect
- MTTR (Mean Time to Recover): Time to recover
✦ AI Mock Interview
Type your answer and get instant AI feedback
Sign in to use AI scoring
