The alerting system was useless in the worst possible way — not because it was not alerting, but because it was alerting too much. 200+ alerts per day, almost all noise. On-call engineers had trained themselves to ignore it. That means real issues could go undetected for hours.
I audited every alert rule. For each one I asked: if this fires, does someone need to do something right now? If the answer was no, I either removed it, downgraded it to a warning, or rolled it into a daily digest. I also updated thresholds set on guesses — pulled historical data, found real baselines, set thresholds that reflect actual behavior. Anything that fired but had no runbook got a runbook before it was re-enabled.
We went from 200+ alerts per day to under 20, all actionable. On-call engineers started reading alerts again. Within a few weeks we caught a real incident from an alert — which sounds basic, but that is the whole point of monitoring.