Architectural Brief: Idealo Price Optimization Platform

Idealo.de is a comparison shopping platform where every seller's price sits next to every other seller's price. A product listed €2 above the cheapest offer gets almost no traffic. The sellers who win on Idealo aren't the ones with the best products. They're the ones who reprice fastest. This system was designed so that a seller can submit a list of product IDs, get back live competitor data scraped from Idealo's offer pages, and receive a recommended price that accounts for competitor positioning, shop reputation, delivery speed, and margin floors. The entire chain (scrape, preprocess, optimize) runs as an async pipeline behind a REST API.

System Topology

Infrastructure Decisions

Compute: Django with Gunicorn behind Docker Compose (5 services: web, celery, redis, postgres, ngrok). Chose over a serverless approach because scraping jobs can run for 2+ minutes per product, the Gunicorn timeout is set to 1,300 seconds to accommodate long-running report polling, and the Celery worker needs persistent access to the PostgreSQL database for upserts during scraping. Deployed on Render with a separate worker process for Celery.
Data Layer: PostgreSQL 14 with 8 Django models: Product, Seller, Offer, PriceHistory, PricePoint, OptimizationTask, ScrapeTask, and IdealoOfferReport. Chose over SQLite because Celery workers need concurrent write access during parallel scrape tasks. The Offer model has a composite unique constraint on (product, seller, product_link) to prevent duplicate entries after re-scraping. Seller is stored as a first-class entity (not just a string field) with its own ManyToMany relationship to products.
Task Queue: Celery with Redis 6-Alpine as broker. Chose Celery over a simple threading approach because scraping multiple products needs parallel execution with failure isolation. Tasks use group and chord patterns: a group of scrape_product_task calls runs in parallel, then a chord callback assembles results and updates the ScrapeTask record. Memory safety: a 450MB threshold in check_memory_usage() kills the worker process if memory consumption exceeds it. The Render deployment caps Celery at --max-tasks-per-child=3 and --max-memory-per-child=120000 to limit leak accumulation.
Scraping Layer: Requests + BeautifulSoup, not Playwright or Selenium. Chose requests because Idealo's offer list pages return server-rendered HTML. No JavaScript rendering required. The scraper paginates through Idealo's offer list structure, deduplicates offers by a composite key (Product Name, Price, Seller), and detects duplicate pages to break infinite pagination loops after 2 consecutive matches.
Optimization Algorithm: Decimal-precision arithmetic (Python Decimal with precision 10). The first version used float arithmetic. On a €2,800 product with sequential delivery and reputation adjustments, float and Decimal diverged by €0.03 per calculation. Invisible on one product. Across 200 repriced daily, the drift compounds into pricing inconsistency the seller notices. The algorithm uses two strategies depending on current rank: if rank 1, try to raise price within the gap below the second-cheapest competitor; if not rank 1, undercut the cheapest by 5% but never below the margin floor.
External Integration: Idealo Business API via OAuth2 (client_credentials flow). The handler polls report status every 10 seconds, downloads the completed report as a ZIP containing a CSV, parses it in memory, and upserts to the IdealoOfferReport model. No intermediate file I/O. The offer report feed is the authoritative source for the seller's own product catalog on Idealo.

Constraints That Shaped the Design

Input: Idealo product IDs (either submitted manually via the API or extracted from the Idealo Business API offer report). HTML offer pages scraped with a 1-second sleep between requests. Offer data includes: product name, price, shipping price, seller name, shop rating, number of ratings, delivery date.
Output: Optimized price recommendations via JSON API, including initial rank, new rank, initial price, optimized price, cost price, and adjusted shipping cost. Aggregated offer data (store rank, total listings, cheapest competitor, second cheapest) also available as a separate endpoint.
Scale Handled: The system handles products one at a time or in batches via Celery groups. The Celery chord timeout is 120 seconds per product. At 500+ products per batch, the current architecture would need task chunking to avoid Redis memory pressure and chord callback accumulation.
Hard Constraints: Idealo returns 409 if a report is already in progress for the same shop. The scraper uses a 1-second sleep between page fetches (configurable via settings.json). The optimization algorithm enforces a 15% minimum margin floor. Delivery adjustments cap at +3% for 3-day delivery and -4% for 14+ day delivery. Shop rating normalization uses a 5,000-review ceiling for the ratings weight.

Decision Log

Decision	Alternative Rejected	Why
Django REST Framework with Swagger (drf-yasg)	FastAPI or Flask	8 Django models with 12 migrations across development. The admin panel saved hours during debugging: inspecting Offer records, checking ScrapeTask status, and verifying IdealoOfferReport data without writing queries. FastAPI would have required SQLAlchemy setup, Alembic migration config, and a separate admin tool (pgAdmin or Retool) for the same visibility.
Celery chords for parallel scraping	Sequential per-product scraping	Timing math: each product scrape takes 3-8 seconds (1s sleep + page load + parsing). 20 products sequentially = 60-160 seconds minimum. Celery groups run them across worker slots in parallel, completing in the time of the slowest single scrape. The chord callback fires only after all tasks complete, updating the ScrapeTask record atomically. Without chords, tracking "are all 20 done?" requires polling or a custom coordination mechanism.
BeautifulSoup over Playwright/Selenium	Playwright for JavaScript-rendered pages	Docker image with BeautifulSoup: ~200MB. Docker image with Playwright and Chromium: ~600MB+. The Render deployment already runs at the memory ceiling (450MB kill switch on the Celery worker). Adding a headless browser inside that budget was physically impossible. Idealo's offer pages are server-rendered HTML, so the browser would have added 400MB of overhead for zero functional benefit.
Composite offer key (product, seller, product_link)	Simple product_id + seller_name uniqueness	Multiple sellers can list the same product at different URLs with different shipping terms. product_link distinguishes between a seller's direct storefront and their marketplace listing. Without it, re-scraping would create duplicate offers.
450MB memory kill switch in Celery worker	Letting the OS handle OOM	Render's free tier has limited memory. A Celery worker that leaks through BeautifulSoup parsing and DataFrame conversion can accumulate past the allocation and trigger an OOM kill with no cleanup. The 450MB threshold force-kills the worker process, which Celery auto-restarts via `--max-tasks-per-child`.
Ngrok tunnel in Docker Compose	VPN or public deployment for Excel add-in	The Excel add-in needs HTTPS to connect to the API during local development. Ngrok provides a public HTTPS tunnel to the local Docker web service without DNS configuration. The 4040 inspection port allows debugging webhook traffic.

Architectural Brief: Idealo Price Optimization Platform

System Topology

Infrastructure Decisions

Constraints That Shaped the Design

Decision Log

Contents