E-commerce Web Scraping Pipeline

Overview

This project involves the creation of an on-cloud, persistent, web-scraping data pipeline to generate a TB-scale data source for research. In particular, the data spans at least two years of vehicle sale data on large public e-commerce platforms The generated data supports research groups at both the School of Information and School of Environment and Sustainability, at the University of Michigan.

I served as the primary data engineer on this project, performing roles encompassing cloud infrastructure development, web scraping development, and site reliability engineering. I also served as the primary consultant for the research groups, leading stakeholder meetings to discover data requirements, configure data access patterns, and explain performance/cost/reliability trade-offs.

The pipeline went through two key changes:

Technical Components

Systems design diagram of data pipeline.
Systems design diagram of data pipeline.

Technologies and Tools

Requirements

Data requirements:

Scale:

Fault tolerance, data backup, and safety:

Services

Search job (AWS Batch job) Scrape job (AWS Batch job) Data fetcher (UM Server cron job) Telemetry (AWS Lambda function)

Highlighted tasks

Scaling pipeline 100x

Before Fall of 2023 the stakeholders were interested in researching only Electric Vehicles. Later, we realized that they also wanted to capture all vehicle data, and not just EV's. Practically, this resulted in a 100x increase in throughput of our pipeline

From a project management standpoint, I took the lead explicating the engineering challenges to both the stakeholders and the engineering team .

From a technical standpoint, I quickly realized that every component of the pipeline would be stressed at the 100x level. A quick summary of changes I implemented:

Cutting costs by half

After adjusting the pipeline to accommodate a 100x throughput, we quickly realized that the costs of our pipeline would be tremendous, and increase month over month. So, I redesigned the pipeline, including components and services, to decrease costs while maintaining requirements.

My first step after redesigning was to perform a cost/benefit analysis:

From a technical standpoint, I implemented the following changes: