Data pipelines that seeded a marketplace from scratch

A car marketplace with an empty catalog is useless. Before launch, the platform needed thousands of real listings — new car configurations with accurate specs and photos, and used car listings scraped from existing platforms in the region. Manually entering this data was out of the question; the catalog required coverage across dozens of brands, hundreds of models, and thousands of configurations. We built two separate ingestion pipelines: a Python scraper for new car data from a major automotive listing site, and a TypeScript-based browser parser for used car listings from two different platforms.

new car pipeline

diagram showing the data flow from source sites through scrapers to normalized database entries

The new car pipeline is the simpler of the two. The target site publishes structured new car listings with consistent HTML markup — brand pages, model pages, configuration pages, each with a predictable layout. We wrote the scraper in Python using requests and BeautifulSoup, navigating the site's hierarchy from brand index down to individual configurations. For each configuration, the scraper extracts everything: engine specs, transmission type, body style, dimensions, features list, pricing, and photo gallery URLs. Photos are downloaded in parallel, resized to the platform's standard dimensions, and stored with predictable filenames keyed to the configuration ID. The output is a structured JSON file per brand that maps directly to the Prisma schema's CarConfiguration model, ready for database seeding.

used cars and playwright

The used car pipeline is a different beast entirely. Two source platforms, two completely different page structures, and both actively trying to prevent automated access. We chose TypeScript with Playwright (specifically rebrowser-playwright, a fork optimized for stealth) because both sites rely heavily on JavaScript rendering — traditional HTTP scraping would miss most of the content. The parser launches headless browser instances, navigates to listing pages, waits for dynamic content to load, and extracts data from the rendered DOM. Pagination handling differs between sources: one uses infinite scroll (requiring scroll simulation and mutation observation), the other uses numbered pages with URL parameters.

anti-bot and maintenance

terminal output showing scraper progress with success and retry counts across different source platforms

Anti-bot detection was the dominant challenge. Both used car platforms employ fingerprinting, rate detection, and behavioral analysis. A naive Playwright script gets blocked within minutes. We implemented browser fingerprint rotation — randomized viewport sizes, user agents, language headers, and WebGL parameters per session. Request throttling introduces human-like delays between page loads, randomized within a configurable range. When a block is detected (CAPTCHA page, HTTP 403, or an empty response where content is expected), the parser rotates to a fresh browser context with a new fingerprint. This cat-and-mouse dynamic means the scrapers require periodic maintenance as detection methods evolve, but the rotation strategy has kept block rates below 5% on average.

normalization and photos

Normalizing data from two different sources into a single schema was the second major challenge. Source A structures used car data with the make and model as flat text fields, generation inferred from the year, and specs scattered across a description blob. Source B has structured dropdowns for make and model but uses a completely different naming convention — abbreviations, regional model names, transliterated brand names. We built a normalization layer that maps source-specific fields into the platform's brand/model/generation/configuration hierarchy, using a combination of exact matching, fuzzy matching, and a manually curated alias table for the most common mismatches. When a listing can't be confidently mapped, it's flagged for manual review rather than silently dropped.

Photo downloads turned out to be the most operationally fragile part of the pipeline. Used car listings typically have 10-30 photos each, and scraping thousands of listings means downloading tens of thousands of images. Downloads fail for all the usual reasons — timeouts, rate limits, broken URLs, servers returning HTML error pages instead of image data. We implemented retry logic with exponential backoff and a resume capability: the pipeline tracks which photos for which listings have been successfully downloaded, and on restart, it picks up where it left off rather than starting from scratch. A single full ingestion run for one source platform takes about six hours; without resume capability, any failure in hour five would mean starting over.

idempotent seeding and ops

The pipelines seed the database through Prisma, running in a dedicated seeding mode that uses upsert operations. If a car configuration already exists (matched by a composite key of brand, model, generation, and source ID), the record is updated rather than duplicated. This idempotency means we can re-run the pipelines safely — useful when source data gets corrected or when we add new fields to extract. The initial seeding populated the marketplace with over 4,000 new car configurations and 12,000 used car listings, giving the platform a credible catalog from day one.

The bigger takeaway: data ingestion is infrastructure, not a one-time script. These pipelines run on a schedule, need monitoring, break when source sites change, and require ongoing maintenance. Treating them as throwaway scripts would have been a mistake — the normalization layer, the resume logic, the fingerprint rotation, and the idempotent seeding are all investments that pay off every time the pipeline runs. The Python scraper hasn't needed changes in months because its source is stable. The Playwright parsers need attention every few weeks. That asymmetry is just the nature of scraping, and planning for it from the start saved us from the panic of a broken pipeline the night before a launch.

Stack

New car pipeline: Python, requests, BeautifulSoup

Used car pipeline: TypeScript, Playwright / rebrowser-playwright

Data layer: Prisma 6 (seeding), PostgreSQL

Normalization: Fuzzy matching, curated alias tables

Infra: Retry with resume, fingerprint rotation, scheduled runs