Narrative

Eliminating Recurring SEV-2 Incidents

We were averaging more than 3 SEV-2 incidents per week. The team was good at responding — but they were essentially restarting the same services for the same reasons, week after week. Fast response without fixing root causes is not reliability, it is a treadmill.

SREIncident ManagementRoot Cause Analysis

What Was Broken

How It Was Built

I categorized incidents over the previous 8 weeks. Most SEV-2s fell into 3-4 repeating patterns. For each pattern, I did a proper root cause analysis — not service crashed, restart it but why did the service crash, and what needs to change so it does not. Some were configuration issues, some were resource limits set too low, some were upstream dependencies with no circuit breaking. I fixed the actual causes and wrote runbooks for anything that could not be fully automated.

What Changed

Recurring incidents dropped substantially. The team went from reactive firefighting to having capacity to do actual engineering work.