IIT Kanpur

Research Intern

📅May 2019 – Jul 2019·📍On-site

I joined an active research pipeline that was severely bottlenecked by a deeply inefficient, manual verification step that consumed over 2 hours of a researcher's time during every single test cycle.

PythonPandas

Verification time per cycle

10 min (automated)

Manual data cleaning

Automated

// overview

What I Did Here

I joined an active research pipeline that was severely bottlenecked by a deeply inefficient, manual verification step that consumed over 2 hours of a researcher's time during every single test cycle. This manual toil was actively slowing down the pace of academic iteration. Rather than accepting this as normal, I architected and built a comprehensive test automation utility in Python that codified these verification steps into rigorous programmatic assertions. This tool completely replaced the human element, reducing the 2-hour manual slog into a flawless, 10-minute automated run. Furthermore, the raw data being fed into the models was inconsistently formatted across various sources, forcing researchers to hand-clean CSV files constantly. I engineered robust Pandas preprocessing scripts designed specifically to handle and normalize all observed format variations—from missing values to encoding discrepancies. The output was a standardized, mathematically clean dataset that fed directly into the research pipeline, eliminating manual data wrangling entirely and allowing the team to focus strictly on their core models.

// responsibilities

What I Was Accountable For

The research team was losing 2 hours per test cycle to a highly repetitive, manual verification process.

Engineered a custom Python test automation utility that executed these verifications programmatically, shrinking the task to a 10-minute automated run.

Raw sensor data was wildly inconsistent, requiring researchers to spend hours manually cleaning formatting errors before they could run their models.

Built robust data preprocessing scripts in Pandas that autonomously standardized output regardless of input variations.

The academic environment lacked standard software engineering safeguards, leading to frequent silent errors in data processing.

Introduced rigorous programmatic assertions and explicit error logging to ensure data integrity before it reached the core models.

Eliminated the manual data wrangling bottleneck entirely, creating a seamless, automated pipeline from raw data ingestion to normalized output, significantly accelerating the research iteration speed.

// impact

Key Wins

The research team was losing 2 hours per test cycle to a highly repetitive, manual verification process.

Engineered a custom Python test automation utility that executed these verifications programmatically, shrinking the task to a 10-minute automated run.

10 min (automated)

Raw sensor data was wildly inconsistent, requiring researchers to spend hours manually cleaning formatting errors before they could run their models.

Built robust data preprocessing scripts in Pandas that autonomously standardized output regardless of input variations.

Automated

The academic environment lacked standard software engineering safeguards, leading to frequent silent errors in data processing.

Introduced rigorous programmatic assertions and explicit error logging to ensure data integrity before it reached the core models.

// deep dive

How It Was Built

Test Automation Utility

The manual verification process required researchers to visually compare hundreds of rows of output data against an expected baseline, which was both incredibly slow and highly susceptible to human fatigue. I built a Python utility that entirely replaced this human element. The script programmatically loads the output and the baseline, and runs strict mathematical assertions across the entire dataset. If a value deviates beyond the acceptable research tolerance, the script explicitly logs the discrepancy, the exact row, and the magnitude of the error into a generated summary report. This shifted the process from a human staring at a screen for 2 hours to a script executing flawlessly in 10 minutes. A critical edge case handled was floating-point inaccuracies inherent in the models; I implemented `math.isclose()` assertions to ensure the tests didn't fail on insignificant precision differences.

Data Preprocessing Pipeline

The researchers were wasting hours in Excel trying to align date formats and fix null values before their code could run. I engineered a robust, automated preprocessing pipeline using Pandas to handle this dynamically. The pipeline ingests the raw, messy CSVs and applies a strict series of normalization functions: it standardizes all timestamps to UTC, aggressively identifies and imputes missing values based on statistical means or forward-filling, and normalizes disparate text encodings into standard UTF-8. The final output is a pristine, standardized CSV that is fed directly into the research pipeline. The primary architectural decision was making the pipeline fail-loud; if it encounters a completely unknown data anomaly, it halts and flags the file rather than guessing and silently corrupting the academic research data.

// results

What Changed

The introduction of these automation utilities fundamentally accelerated the research team's velocity. By fully automating the verification step, the 2-hour manual bottleneck was permanently reduced to a 10-minute programmatic run, allowing for vastly more test iterations per day. The Pandas preprocessing pipeline successfully handled all observed data format variations seamlessly, completely eliminating the need for manual data cleaning. Ultimately, these tools proved that applying rigorous software engineering principles to academic research yields massive efficiency gains.

Verification time per cycle

2 hours (manual)

→

12× faster

In a research environment, the speed of iteration is the speed of discovery. Reducing the verification cycle by 12x meant the team could run dozens of experiments in a day instead of just two, fundamentally altering the pace of their academic output.

Manual data cleaning

Required per batch

→

Eliminated

A researcher's time should be spent designing algorithms, not formatting cells in Excel. Automating the data cleaning process entirely removed a massive source of frustration and ensured the data feeding the models was mathematically consistent every single time.

"Learned early that production-grade code is about reliability, not cleverness. A script that runs correctly every time is worth more than a clever one that fails mysteriously."