HCL Technologies

Analyst (Site Reliability Engineer)

📅Mar 2023 – Sep 2023·📍On-site

I took ownership of a massive, legacy system that was averaging more than three SEV-2 incidents every single week.

PythonBashPrometheusLinux
Recurring SEV-2 incidents
Significantly reduced
Operational toil
Automated
Alert noise
<20/day actionable

What I Did Here

I took ownership of a massive, legacy system that was averaging more than three SEV-2 incidents every single week. The existing operational culture prioritized rapid response over root cause analysis; the team was incredibly fast at restarting crashed services, but they were restarting the exact same services for the exact same reasons week after week. I fundamentally changed this approach by categorically refusing to simply 'tune the restart scripts' and instead diving deep into the system logs to identify the core architectural failures. By profiling the application, I uncovered severe connection pool exhaustion and undetected memory leaks, which I patched directly. Simultaneously, I observed that nearly 30% of the team's weekly capacity was being drained by monotonous operational toil like manual log rotation and certificate renewals. I wrote resilient bash automation to handle these tasks autonomously. Finally, the alerting system was utterly broken, spamming the team with over 200 false positives a day; I rebuilt the entire ruleset using Prometheus composite alerts, dropping the noise to under 20 truly actionable notifications. The result was a dramatic stabilization of the platform and a massive reduction in team burnout.

What I Was Accountable For

01

Reduced alert noise from 200+/day → <20 actionable signals using calibrated detection rules

02

Decreased SEV-2 incidents by fixing root causes instead of reactive restarts

03

Automated operational tasks (cert renewals, log rotation, health checks), eliminating ~30% toil

04

Improved system reliability through performance tuning (connection pools, GC, queries)

Key Wins

Reduced alert noise from 200+/day → <20 actionable signals using calibrated detection rules
Significantly reduced
Decreased SEV-2 incidents by fixing root causes instead of reactive restarts
Automated
Automated operational tasks (cert renewals, log rotation, health checks), eliminating ~30% toil
<20/day actionable

How It Was Built

01

Root Cause Analysis — Pattern First

The first step was stopping the bleeding. Instead of responding to the next incident, I pulled the post-mortem data for the last 8 weeks and categorized every outage. I discovered that 80% of the SEV-2s fell into exactly three recurring patterns: database connection pool exhaustion under sudden load, out-of-memory (OOM) kills on a specific microservice with a memory leak, and aggressive health checks failing due to misconfigured timeout thresholds. I tackled these root causes directly. I tuned the connection pool sizes to match the actual calculated peak throughput rather than the default framework settings. I worked with the dev team to patch the memory leak, and I increased the health check timeouts to allow for expected network latency during heavy batches. The critical edge case was ensuring the new connection pool limits didn't inadvertently starve other smaller services sharing the same database; I mitigated this by implementing strict resource quotas on the database side.

02

Toil Automation

The engineering team was spending hours every week manually logging into servers to rotate bloated log files, check SSL certificate expiration dates, and manually restart secondary services that had hung. This was an unacceptable waste of highly paid engineering talent. I engineered a suite of robust bash scripts to handle these tasks autonomously. For example, the certificate checking script utilizes `openssl` to verify the expiration dates of all domains in the portfolio; if any certificate is within 14 days of expiration, it automatically fires a warning email to the ops team. These scripts were hardened with strict error handling and deployed centrally via cron. By automating these repetitive tasks, the tasks became zero-touch, instantly reclaiming approximately 30% of the team's weekly capacity.

ops/cert_check.sh
bash
#!/bin/bash
for DOMAIN in "${DOMAINS[@]}"; do
  EXPIRY=$(echo | openssl s_client -servername $DOMAIN \
    -connect $DOMAIN:443 2>/dev/null \
    | openssl x509 -noout -enddate \
    | cut -d= -f2)
  DAYS_LEFT=$(( ($(date -d "$EXPIRY" +%s) - $(date +%s)) / 86400 ))
  if [ $DAYS_LEFT -lt 14 ]; then
    echo "WARN: $DOMAIN expires in $DAYS_LEFT days" | mail -s "Cert Alert" ops@team.com
  fi
done
03

Alert Rebuild — Composite Rules

The existing monitoring system was actively harmful because it trained the team to ignore warnings. I audited every single existing alert rule against a ruthless standard: 'If this alert fires at 3 AM, does an engineer need to wake up immediately to fix it?' If the answer was no, the alert was either deleted entirely, downgraded to a silent warning, or aggregated into a daily email digest. The remaining critical alerts were rebuilt as composite rules in Prometheus. I configured the rules to only trigger when multiple correlated symptoms occurred simultaneously—for instance, high CPU utilization is only an alert if it is accompanied by elevated HTTP 5xx error rates. This massive cleanup dropped the daily alert volume from over 200 down to fewer than 20. When the pager goes off now, the team knows the system is genuinely degraded.

What Changed

The transition from reactive ops to proactive SRE was highly successful. By patching the root causes, the frequency of recurring SEV-2 incidents plummeted, drastically improving overall platform uptime. The automation suite reclaimed 30% of the team's weekly capacity, allowing engineers to focus on architectural improvements rather than monotonous maintenance. Slashing the alert noise by 90% restored absolute trust in the monitoring stack, significantly reducing burnout and improving the team's morale.

Recurring SEV-2 incidents
3+/week (same patterns)
0
Root causes fixed

A team that is constantly firefighting cannot innovate. By fixing the root causes and eliminating the recurring SEV-2s, the team finally had the breathing room to step back and actually engineer solutions, fundamentally shifting the culture from ops to SRE.

Operational toil
30% of team capacity
0
Capacity reclaimed

Engineering time is the most expensive resource in the company. Reclaiming 30% of the team's capacity by automating manual toil was the equivalent of hiring an additional senior engineer, massively accelerating the delivery of new infrastructure projects.

Alert noise
200+/day
0
↓ 90%

Alert fatigue is a massive operational risk because it guarantees that a real incident will eventually be ignored. Reducing the noise by 90% ensured that the on-call rotation was no longer a dreaded punishment, and engineers could trust that a page meant a real problem.

"The team stopped firefighting the same fires. Fixing root causes instead of tuning restart scripts is the difference between SRE and ops."