InfoZ IT Services

DevOps Engineer

📅July 2025 – Present·📍Remote

When I joined, the deployment pipeline was so fragile that every other release required an engineer to manually hotfix the environment, causing immense friction between development and operations.

DockerPythonBashGitHub ActionsPrometheus

Deploy failure rate

Near-zero

Monthly cloud cost

-40%

Mean time to detection

<5 minutes

Alert noise

<20/day

// overview

What I Did Here

When I joined, the deployment pipeline was so fragile that every other release required an engineer to manually hotfix the environment, causing immense friction between development and operations. The obvious solution was to add more pre-deployment testing, but testing cannot catch environment drift. Instead, I rebuilt the entire delivery mechanism around staged rollouts with strict, automated rollback triggers. Because cloud costs were spiraling, I audited the infrastructure and discovered that nearly half the instances were drastically oversized for their actual load. I right-sized the compute layer and implemented scheduled scaling rules, significantly cutting costs without touching application availability. Furthermore, the alerting system was generating hundreds of noisy notifications daily, teaching engineers to ignore them. I overhauled the monitoring stack to use composite alerting rules, ensuring that when a pager went off, it was a genuine incident. The end result was an infrastructure that essentially ran itself, transforming the ops culture from reactive firefighting to proactive engineering.

// responsibilities

What I Was Accountable For

Designed secure cloud infrastructure (IAM, RBAC, environment isolation) across dev/staging/prod

Reduced MTTD from hours to <5 minutes by rebuilding monitoring and alerting (Prometheus, Grafana)

Drove deploy failures to near-zero using staged rollouts and automated rollback triggers

Automated 12+ runbooks, reducing manual ops time from hours to <3 minutes

Cut cloud costs ~40% via right-sizing and scheduled scaling without impacting availability

// impact

Key Wins

Designed secure cloud infrastructure (IAM, RBAC, environment isolation) across dev/staging/prod

Near-zero

Reduced MTTD from hours to <5 minutes by rebuilding monitoring and alerting (Prometheus, Grafana)

-40%

Drove deploy failures to near-zero using staged rollouts and automated rollback triggers

<5 minutes

// deep dive

How It Was Built

Staged Rollout Pipeline with Auto-Rollback

The core problem with the existing deployment process was that a bad release instantly affected 100% of the user base, forcing engineers to scramble to revert the changes manually at 2 AM. The obvious solution was better testing, but tests cannot perfectly simulate production traffic. Instead, I rebuilt the deployment pipelines to utilize a strict canary phase. The new pipeline automatically routes exactly 10% of live traffic to the new version for a 15-minute observation window. During this window, the pipeline continuously polls health metrics; if the error rate spikes or the p99 latency exceeds a defined threshold, the pipeline automatically aborts and triggers a rollback to the previous version. This mechanism completely removes the human from the critical path of a failure. The primary edge case I had to handle was ensuring database migrations were backward-compatible so that a rollback didn't corrupt the data layer. This pipeline transformed deployments from terrifying events into boring, routine operations.

deploy.yml

yaml

- name: Canary deploy (10%)
  run: kubectl set image deployment/app container=$IMAGE
  env:
    TRAFFIC_WEIGHT: 10

- name: Health gate (15 min)
  uses: ./.github/actions/health-check
  with:
    error_threshold: 0.5%
    duration: 15m

- name: Full promotion or rollback
  run: |
    if [ "$HEALTH_STATUS" = "pass" ]; then
      kubectl rollout resume deployment/app
    else
      kubectl rollout undo deployment/app
    fi

Infrastructure Right-Sizing

Cloud costs were spiraling not because of traffic, but because developers had habitually provisioned the largest available instances 'just to be safe'. The problem within the problem was that no one knew exactly which instances were actually doing work. I executed a comprehensive 2-week utilization audit using Prometheus metrics, analyzing CPU and memory high-water marks across the entire fleet. I discovered that 40% of the instances were consistently running at under 15% CPU utilization. Instead of just downsizing them, which might risk performance during sudden spikes, I implemented aggressive scheduled scaling. Staging and development environments were configured to automatically scale down to zero during nights and weekends, while production environments were set to scale up precisely 30 minutes before known historical peak traffic windows. This approach required careful tuning of the application startup times to ensure the new instances were ready to accept traffic immediately. This action drastically cut the monthly AWS bill without a single user noticing a degradation in service.

Alerting Overhaul — Composite Rules

The monitoring dashboard was a sea of red. Over 200 alerts fired every single day, mostly single-threshold CPU spikes that resolved themselves seconds later. The psychological result was absolute alert fatigue; when a real database failure occurred, the on-call engineer assumed it was just another noisy spike and ignored it for an hour. I decided to delete every single existing alert and start from scratch. I replaced the fragile single-metric thresholds with sophisticated composite rules. For example, instead of alerting when CPU hit 80%, the new rule only fired if CPU hit 80% AND the application's request latency simultaneously breached the SLA AND the HTTP 500 error rate was elevated. If the CPU spiked but the application was still serving requests quickly, it was explicitly deemed a non-issue. This fundamental shift from alerting on 'cause' to alerting on 'symptom' dropped the alert volume by 90%. The edge case was ensuring the composite evaluation windows were perfectly aligned so they didn't miss sharp, transient failures. Today, when the pager goes off, the team knows the system is actually broken.

// results

What Changed

Within the first month, the delivery pipeline stabilized, dropping deployment failure rates to near-zero and eliminating the need for manual midnight rollbacks. The infrastructure audit and subsequent right-sizing reduced the total monthly cloud expenditure by a massive 40%. The alerting overhaul restored trust in the monitoring stack; on-call engineers began reacting instantly, pulling the Mean Time To Detection (MTTD) down from hours (when users reported it) to under 5 minutes. Ultimately, the system achieved true environment parity, making operations predictable and vastly accelerating the entire engineering lifecycle.

Deploy failure rate

Frequent manual fixes

→

↓ ~100%

Every single deploy failure meant a senior engineer was staying late, manually reverting database changes, and investigating what broke in the logs. Achieving a near-zero failure rate meant release day was no longer a major event; it became entirely indistinguishable from any other day of the week, which is exactly how a mature pipeline should feel.

Monthly cloud cost

Baseline

→

40% reduction

Burning capital on idle infrastructure is the most frustrating form of technical debt. Reducing the monthly cloud cost by 40% immediately freed up significant operational budget, which leadership was able to reallocate toward hiring an additional engineer rather than paying AWS for empty CPU cycles.

Mean time to detection

User reported (~hours)

→

~12× faster

When a system relies on users to report that it is broken, the engineering team has fundamentally failed. Pulling the MTTD down to under 5 minutes meant the team was usually already diagnosing the issue before the first customer support ticket was even filed, drastically improving the perceived reliability of the product.

Alert noise

200+/day

→

↓ 90%

An on-call rotation that pages an engineer 200 times a day is a fast track to severe burnout and high turnover. By reducing the noise by 90%, the alerts regained their authority; engineers started sleeping through the night and only woke up when their skills were actually required.

"The system runs itself. No 3am calls, no manual deploys, no configuration mysteries."

// related