Narrative

Enterprise Web Scraper — 10,000+ Products Tracked

manual, inconsistent10,000+products tracked

The team was making pricing decisions based on manually collected competitor data that was inconsistent, outdated, and incomplete. You cannot spot trends or make good pricing decisions on data collected sporadically by hand.

PythonScrapySeleniumPandasAutomation

What Was Broken

How It Was Built

Two deliberate design decisions defined this project.

Scrapy + Selenium — deliberate dual-engine approach
  • Most scrapers pick one tool.
  • 📄 pipeline.py
Fail loudly — not silently
  • A scraper that breaks silently is worse than no scraper — you think you have data, but you do not.
  • 📄 validation.py
Structured output — DataFrames, not raw dumps
  • Output was structured into clean Pandas DataFrames before storage — not raw HTML that someone else has to parse.

Scrapy + Selenium — deliberate dual-engine approach

Most scrapers pick one tool. I used both intentionally. Scrapy handles high-volume static content fast. Selenium handles JavaScript-rendered pages where content loads dynamically after page load — common on modern e-commerce sites. Forcing Scrapy to handle dynamic pages, or using Selenium for everything, would have been slower and more fragile.

pipeline.py
python
class ScrapingPipeline:
    def __init__(self):
        self.scrapy_spider = StaticSpider()
        self.selenium_scraper = DynamicScraper()
    
    def scrape_product(self, url: str) -> Product:
        # Detect if page requires JS rendering
        if self.requires_js_rendering(url):
            raw = self.selenium_scraper.fetch(url)
        else:
            raw = self.scrapy_spider.fetch(url)
        
        return self.normalize(raw)
    
    def requires_js_rendering(self, url: str) -> bool:
        # Check against known JS-heavy domains
        return any(
            domain in url 
            for domain in JS_RENDERED_DOMAINS
        )

Fail loudly — not silently

A scraper that breaks silently is worse than no scraper — you think you have data, but you do not. I built explicit failure detection: if a scrape fails or returns unexpected structure, it logs loudly and triggers an alert rather than writing bad data downstream.

validation.py
python
def validate_and_store(product: dict) -> None:
    required_fields = [
        'name', 'price', 'availability', 'url'
    ]
    
    missing = [
        f for f in required_fields 
        if not product.get(f)
    ]
    
    if missing:
        # Fail loudly — never silently
        alert_ops(
            f"Scrape validation failed: "
            f"missing {missing} for {product.get('url')}"
        )
        log_to_review_queue(product)
        return  # Do NOT write bad data
    
    store_to_database(product)

Structured output — DataFrames, not raw dumps

Output was structured into clean Pandas DataFrames before storage — not raw HTML that someone else has to parse. The pipeline runs on schedule, handles site structure changes through configurable field mappings, and can extend to new sites without rewriting core logic.

What Changed

Teams went from reacting to competitor pricing changes after the fact to having continuous structured data updated on a regular cadence.

Products Tracked
manual, inconsistent
0
continuous
Data Freshness
days old
0
always current
Manual Collection
significant staff time
0
fully automated
"The manual process it replaced was replaced permanently. Teams shifted from reacting late to tracking trends continuously."

Common Questions

Request throttling, randomized delays, and rotating user-agent strings for basic protection. For sites with more aggressive measures, I used headless browser sessions with realistic browsing patterns through Selenium. I also built retry logic with exponential backoff so a temporary block does not kill a full run.
CSS selectors and XPath expressions were externalized into config files, not hardcoded. When a site changed layout, you update the config, not the scraper code. I also added a validation layer that checks whether expected fields are present — if missing, it flags a config review rather than writing empty data.
Structured into DataFrames, then written to a database with timestamps. This meant historical data was preserved — you could compare today's price to last week's, not just see the current snapshot.