Pick a vendor before you instrument anything
2-3 hr of research
Availability: 99.92% (target 99.9%). Within budget. p99: 187ms (target <200ms).
Switch to structured logging (JSON, with a schema)
Half a day
Availability dropped to 99.7% — 30min outage from DB pool. Investigating.
Availability: 99.95%. p99: 178ms. Error rate: 0.04%. All within budget.
Capture the four golden signals as metrics
Half a day
Availability dipped to 99.8% — 60min outage Stripe webhook delays. Investigating.
Wire up distributed tracing with OpenTelemetry
1 day
Define one or two SLOs the team actually believes in
Half a day to define, ongoing to refine
Reset SLO budget. Targets: 99.9% availability, p99 <200ms, error rate <0.1%.
Set up alerts on burn rate, not on threshold
2-3 hr
Build the four runbooks: latency, errors, saturation, third-party down
Half a day
Run a postmortem on the next real incident
2-4 hr per incident
Review the dashboard weekly; iterate on signal vs noise
30 min/week ongoing