Narrative

200+ Alerts/Day → Under 20 Actionable

The alerting system was useless in the worst possible way — not because it was not alerting, but because it was alerting too much. 200+ alerts per day, almost all noise. On-call engineers had trained themselves to ignore it. That means real issues could go undetected for hours.

MonitoringSREAlertingOn-call

← Narrative Hub

// the problem

What Was Broken

// solution

How It Was Built

I audited every alert rule. For each one I asked: if this fires, does someone need to do something right now? If the answer was no, I either removed it, downgraded it to a warning, or rolled it into a daily digest. I also updated thresholds set on guesses — pulled historical data, found real baselines, set thresholds that reflect actual behavior. Anything that fired but had no runbook got a runbook before it was re-enabled.

// results

What Changed

We went from 200+ alerts per day to under 20, all actionable. On-call engineers started reading alerts again. Within a few weeks we caught a real incident from an alert — which sounds basic, but that is the whole point of monitoring.