The team was making pricing decisions based on manually collected competitor data that was inconsistent, outdated, and incomplete. You cannot spot trends or make good pricing decisions on data collected sporadically by hand.
Two deliberate design decisions defined this project.
Most scrapers pick one tool. I used both intentionally. Scrapy handles high-volume static content fast. Selenium handles JavaScript-rendered pages where content loads dynamically after page load — common on modern e-commerce sites. Forcing Scrapy to handle dynamic pages, or using Selenium for everything, would have been slower and more fragile.
class ScrapingPipeline:
def __init__(self):
self.scrapy_spider = StaticSpider()
self.selenium_scraper = DynamicScraper()
def scrape_product(self, url: str) -> Product:
# Detect if page requires JS rendering
if self.requires_js_rendering(url):
raw = self.selenium_scraper.fetch(url)
else:
raw = self.scrapy_spider.fetch(url)
return self.normalize(raw)
def requires_js_rendering(self, url: str) -> bool:
# Check against known JS-heavy domains
return any(
domain in url
for domain in JS_RENDERED_DOMAINS
)A scraper that breaks silently is worse than no scraper — you think you have data, but you do not. I built explicit failure detection: if a scrape fails or returns unexpected structure, it logs loudly and triggers an alert rather than writing bad data downstream.
def validate_and_store(product: dict) -> None:
required_fields = [
'name', 'price', 'availability', 'url'
]
missing = [
f for f in required_fields
if not product.get(f)
]
if missing:
# Fail loudly — never silently
alert_ops(
f"Scrape validation failed: "
f"missing {missing} for {product.get('url')}"
)
log_to_review_queue(product)
return # Do NOT write bad data
store_to_database(product)Output was structured into clean Pandas DataFrames before storage — not raw HTML that someone else has to parse. The pipeline runs on schedule, handles site structure changes through configurable field mappings, and can extend to new sites without rewriting core logic.
Teams went from reacting to competitor pricing changes after the fact to having continuous structured data updated on a regular cadence.
"The manual process it replaced was replaced permanently. Teams shifted from reacting late to tracking trends continuously."