Narrative

Making On-Call Survivable

The monitoring setup I inherited was either missing things or crying wolf constantly. On-call engineers were getting paged for non-issues and ignoring real ones. Mean time to detection was essentially whenever a user complained.

MonitoringPrometheusGrafanaSRE

What Was Broken

How It Was Built

I audited every existing alert. Most were set on arbitrary thresholds with no understanding of what normal looked like. I rebuilt the alerting layer — started by defining what healthy actually looked like for each service, then set thresholds based on that baseline. I silenced alerts that were pure noise, consolidated related ones, and added runbook links so whoever is on-call knows exactly what to do when something fires.

What Changed

Mean time to detection dropped to under 5 minutes. On-call engineers went from ignoring alerts to actually trusting them. That trust is what makes the whole system work.