Define the severity ladder (Sev-1 / Sev-2 / Sev-3)
2-4 hr decision + write-up
INC-001: Database connection pool exhausted
P0, ~12 min downtime. Root cause: too many idle connections. Fix: connection-pool config.
Postmortem: INC-001 (DB pool exhausted)
Timeline + 5 whys + 3 action items. All shipped within 1 sprint.
Set up the on-call rotation
1-2 days
INC-002: Stripe webhook signature drift
P1, no customer impact. Caught in staging. Stripe API rotation.
Postmortem: INC-002 (Stripe webhook signature drift)
Timeline + root cause + 2 action items. All shipped within 1 sprint.
Wire alerts → on-call → runbooks
1-2 weeks
INC-003: Stripe webhook delivery delays
P1, 30 min lag. Caused by Stripe-side rate limit. No customer impact.
Postmortem: INC-003 (Stripe rate limit)
Timeline + retry-strategy improvements. Action items shipped.
Define the Incident Commander (IC) role
1 day
Postmortem: INC-005 (auto-rebase race)
Atomic rebase lock + tracking issue retry pattern. Shipped.
INC-004: Slack notification failure
P2, internal-only. Slack token expired; renewed + added expiry monitoring.
Build the incident communication templates
1 day
INC-005: Sentinel auto-rebase race condition
P2, no customer impact. Two concurrent rebases stomped each other.
Set up the public status page
1-2 days
Run blameless postmortems on every Sev-1+
Per incident: 5 business days from resolution to published postmortem
Track action items to closure
Weekly, ~30 min
Run a fire drill quarterly + audit + improve
Quarterly, ~1 day per drill