~/About~/Foundry~/Blueprint~/Journal~/Projects
INITIALIZE_CONTACT
Blueprint

Idealo Price Optimization Platform

·6 min read·Kingsley Onoh·

Architectural Brief: Idealo Price Optimization Platform

Idealo.de is a comparison shopping platform where every seller's price sits next to every other seller's price. A product listed €2 above the cheapest offer gets almost no traffic. The sellers who win on Idealo aren't the ones with the best products. They're the ones who reprice fastest. This system was designed so that a seller can submit a list of product IDs, get back live competitor data scraped from Idealo's offer pages, and receive a recommended price that accounts for competitor positioning, shop reputation, delivery speed, and margin floors. The entire chain (scrape, preprocess, optimize) runs as an async pipeline behind a REST API.

System Topology

Infrastructure Decisions

  • Compute: Django with Gunicorn behind Docker Compose (5 services: web, celery, redis, postgres, ngrok). Chose over a serverless approach because scraping jobs can run for 2+ minutes per product, the Gunicorn timeout is set to 1,300 seconds to accommodate long-running report polling, and the Celery worker needs persistent access to the PostgreSQL database for upserts during scraping. Deployed on Render with a separate worker process for Celery.
  • Data Layer: PostgreSQL 14 with 8 Django models: Product, Seller, Offer, PriceHistory, PricePoint, OptimizationTask, ScrapeTask, and IdealoOfferReport. Chose over SQLite because Celery workers need concurrent write access during parallel scrape tasks. The Offer model has a composite unique constraint on (product, seller, product_link) to prevent duplicate entries after re-scraping. Seller is stored as a first-class entity (not just a string field) with its own ManyToMany relationship to products.
  • Task Queue: Celery with Redis 6-Alpine as broker. Chose Celery over a simple threading approach because scraping multiple products needs parallel execution with failure isolation. Tasks use group and chord patterns: a group of scrape_product_task calls runs in parallel, then a chord callback assembles results and updates the ScrapeTask record. Memory safety: a 450MB threshold in check_memory_usage() kills the worker process if memory consumption exceeds it. The Render deployment caps Celery at --max-tasks-per-child=3 and --max-memory-per-child=120000 to limit leak accumulation.
  • Scraping Layer: Requests + BeautifulSoup, not Playwright or Selenium. Chose requests because Idealo's offer list pages return server-rendered HTML. No JavaScript rendering required. The scraper paginates through Idealo's offer list structure, deduplicates offers by a composite key (Product Name, Price, Seller), and detects duplicate pages to break infinite pagination loops after 2 consecutive matches.
  • Optimization Algorithm: Decimal-precision arithmetic (Python Decimal with precision 10). Chose Decimal over float because pricing involves €0.05 increments and percentage-based adjustments where floating-point rounding errors compound. The algorithm uses two strategies depending on current rank: if rank 1, try to raise price within the gap below the second-cheapest competitor; if not rank 1, undercut the cheapest by 5% but never below the margin floor.
  • External Integration: Idealo Business API via OAuth2 (client_credentials flow). The handler polls report status every 10 seconds, downloads the completed report as a ZIP containing a CSV, parses it in memory, and upserts to the IdealoOfferReport model. No intermediate file I/O. The offer report feed is the authoritative source for the seller's own product catalog on Idealo.

Constraints That Shaped the Design

  • Input: Idealo product IDs (either submitted manually via the API or extracted from the Idealo Business API offer report). HTML offer pages scraped with a 1-second sleep between requests. Offer data includes: product name, price, shipping price, seller name, shop rating, number of ratings, delivery date.
  • Output: Optimized price recommendations via JSON API, including initial rank, new rank, initial price, optimized price, cost price, and adjusted shipping cost. Aggregated offer data (store rank, total listings, cheapest competitor, second cheapest) also available as a separate endpoint.
  • Scale Handled: The system handles products one at a time or in batches via Celery groups. The Celery chord timeout is 120 seconds per product. At 500+ products per batch, the current architecture would need task chunking to avoid Redis memory pressure and chord callback accumulation.
  • Hard Constraints: Idealo returns 409 if a report is already in progress for the same shop. The scraper uses a 1-second sleep between page fetches (configurable via settings.json). The optimization algorithm enforces a 15% minimum margin floor. Delivery adjustments cap at +3% for 3-day delivery and -4% for 14+ day delivery. Shop rating normalization uses a 5,000-review ceiling for the ratings weight.

Decision Log

Decision Alternative Rejected Why
Django REST Framework with Swagger (drf-yasg) FastAPI or Flask The system needs ORM-managed database models with migrations, admin panel for data inspection during development, and a Celery integration that benefits from Django's settings infrastructure. FastAPI would have required SQLAlchemy setup and manual migration tooling.
Celery chords for parallel scraping Sequential per-product scraping Scraping 20 products sequentially at 1-second sleep + response time per page would take 10+ minutes. Celery groups parallelize across worker slots. The chord callback aggregates results and updates the ScrapeTask status atomically.
BeautifulSoup over Playwright/Selenium Playwright for JavaScript-rendered pages Idealo's offer list pages are server-rendered HTML. No JavaScript execution needed. BeautifulSoup parses the response in milliseconds. Playwright would have added browser overhead and a heavier Docker image for no benefit.
Composite offer key (product, seller, product_link) Simple product_id + seller_name uniqueness Multiple sellers can list the same product at different URLs with different shipping terms. product_link distinguishes between a seller's direct storefront and their marketplace listing. Without it, re-scraping would create duplicate offers.
450MB memory kill switch in Celery worker Letting the OS handle OOM Render's free tier has limited memory. A Celery worker that leaks through BeautifulSoup parsing and DataFrame conversion can accumulate past the allocation and trigger an OOM kill with no cleanup. The 450MB threshold force-kills the worker process, which Celery auto-restarts via --max-tasks-per-child.
Ngrok tunnel in Docker Compose VPN or public deployment for Excel add-in The Excel add-in needs HTTPS to connect to the API during local development. Ngrok provides a public HTTPS tunnel to the local Docker web service without DNS configuration. The 4040 inspection port allows debugging webhook traffic.
#django#celery#postgresql#idealo-api#price-optimization

The complete performance for Idealo Price Optimization Platform

Get Notified

New system breakdown? You'll know first.