Why I Run the Entire Pipeline Twice to Match Products

The pricing spreadsheet hit 4,162 rows and the retailer stopped updating it. Not because it was hard. Because it took 90 minutes every morning and the prices were wrong by the time he finished. The system I built to replace that process worked on the first deploy. Pulled supplier data, applied markup rules, synced to Shopify. Clean run. But 600 products had no supplier price attached. The logs showed them as "unmatched." Fifteen percent of the catalog, sitting in the store at stale prices.

The first-pass matching was straightforward: take the product's store SKU, find the same SKU in the supplier's price feed, pull the price. For 3,500 products, this worked. For the rest, the store's SKU and the supplier's SKU for the same physical product were completely different strings. One supplier uses one numbering scheme. Another uses something completely different. The same inverter might be STP8.0-3AV-40 in the store and a nine-digit catalog number in the supplier's feed.

The Obvious Fix (and Why It Failed)

The first approach I considered was fuzzy string matching. Run a similarity algorithm across all unmatched SKUs and their possible supplier SKU counterparts, pick the closest match above some threshold.

I abandoned this within a few hours. Solar equipment part numbers are dense alphanumeric strings. STP8.0-3AV-40 and STP10.0-3AV-40 are different products at different price points, one an 8kW inverter and one a 10kW. A fuzzy matcher scores them as 90%+ similar. Attaching the wrong supplier price to a product and pushing it to the live store would be worse than no match at all. One bad price on a €3,000 inverter and the retailer is either losing money or losing the customer.

The Secondary Key

The Shopify catalog already had a metafield for the supplier's own part number. The operations team had been manually entering supplier part numbers for products where the store SKU didn't match. Not for all products, and not consistently, but enough to cover most of the gap.

I restructured the pipeline to run twice. Pass 1 matches on the store's product SKU. Pass 2 takes the unmatched products, swaps in the supplier part number metafield as the matching key, and runs the entire pipeline again: supplier data pull, normalization, merge, pricing rules, competition analysis.

The orchestrator function iterates over the two SKU fields. Before Pass 2 starts, it copies the original product SKU into a separate column so that exclusion rules (which reference the product SKU, not the supplier SKU) still work correctly during the second pass.

Before the second pass runs, there's a filter that strips out rows where the supplier part number is empty or NaN. Without this filter, empty values would match with empty values in the supplier data, creating false positives. I learned that one the hard way: the initial version of Pass 2 created ghost matches where "NaN matched NaN" and products got assigned random supplier prices.

The load-bearing shape is small. The field changes between passes, but the product identity survives the swap:

for sku_field in ("shopify_sku", "supplier_part_number"):
    candidates = products.copy()
    candidates["original_sku"] = candidates["shopify_sku"]
    candidates = candidates[candidates[sku_field].notna()]
    candidates["match_key"] = candidates[sku_field]
    priced_rows = run_pricing_pass(candidates, match_key="match_key")
    collected.append(priced_rows)

The important line is original_sku. Without it, Pass 2 can price a product correctly and still miss the exclusion rule that belongs to the store SKU.

The Deduplication Problem

Running the pipeline twice means some products get matched in both passes. A product whose store SKU and supplier SKU both exist in the feed would appear twice in the output, potentially with two different prices. The deduplication logic sits in a supplier selection function that resolves conflicts.

The selection priority isn't just "pick the lowest price." It's layered:

If the product has a pricing exclusion rule (fixed price or fixed margin set by the operations team), that takes priority over any calculated price. This prevents the automation from overriding manual pricing decisions on specific SKUs.
If the lowest-priced row has zero inventory, the system picks the cheapest option that actually has stock, regardless of price difference. Showing a price the retailer can't fulfill is worse than showing a higher price.
Otherwise, lowest price wins.

The exclusion check was a bug source. During Pass 2, the SKU column gets swapped with the supplier part number for matching. But exclusion rules reference the original product SKU. The early version checked exclusions against the swapped SKU and missed every exclusion. I had to persist the original product SKU in a separate column through the second pass specifically for exclusion lookups.

What Actually Made This Hard

The pricing rules themselves are bracket-based. A Google Sheet defines price ranges with corresponding standard and minimum markup percentages. Products between €0-100 get a different markup than products between €1,000-5,000. A pricing lookup function matches each product's supplier price against the bracket boundaries and returns the applicable markup rate.

But before these rules fire, one product category gets a category-specific surcharge applied to the supplier price itself. This runs first, adjusting the base cost. Then the pricing rules engine treats the uplifted number as the "supplier price" for markup calculation. Getting the order right matters: if you apply the surcharge after the markup, you're compounding. If you apply it before, the markup percentages stay clean. I got this wrong in the first version and it took three hours to find because the prices looked plausible. They were just consistently a few percent too high on every product in that category.

The competition pricing layer adds another dimension. After markup rules, the system downloads live Idealo offers, finds the cheapest competitor for each product, and undercuts by a configurable euro amount. But some competitors are gray-market sellers the retailer doesn't want to compete with. The system maintains both a global ignore list and per-EAN ignore lists. If the cheapest seller is on the ignore list, the system targets the second-cheapest instead. If the second-cheapest is also ignored, no competition adjustment happens.

The Supplier Data Problem

Eleven suppliers, eleven formats. Each cleaner function handles one supplier's quirks:

One supplier sends prices as strings with German decimal notation. The cleaner has to handle both 1595,68 (comma decimal) and values that are already formatted with periods. Another sends stock quantities as text. Their feed might say "Menge: 5 Stk" where I need the integer 5. A third uses a different column name for their SKU field. Each of these is a separate function because a generic normalizer would need so many conditional branches that it would be harder to debug than 11 small, purpose-built cleaners.

I was wrong about the cleaner architecture initially. I started with a single generic cleaner function that used a dictionary of format rules. After the third supplier, the dictionary was incomprehensible. Every new supplier added edge cases that broke the abstraction. Splitting into per-supplier functions felt like admitting defeat, but the codebase became dramatically easier to maintain. When one supplier changed their export format in January, I fixed one function in isolation. No risk of breaking any other supplier's pipeline.

The Result

Coverage went from 85% to 99.5%. The remaining 0.5% are products with no supplier match on either key, typically discontinued items or custom bundles. Processing time is under 7 minutes for the full 4,162-product catalog, down from the 90-minute manual process. The Shopify sync alone dropped from over 30 minutes (individual REST API calls) to about 6 minutes (bulk GraphQL mutations).

The two-pass approach costs almost nothing in compute. The pipeline runs twice, but the data is already in memory from the first pass. The second pass adds roughly 90 seconds to the total run time. For a system that runs on a schedule, not in response to user requests, 90 extra seconds is noise.

If I were starting over, I'd push the secondary SKU matching upstream. Instead of running the whole pipeline twice, I'd build a pre-processing step that creates a unified SKU mapping table before any pricing logic runs. The two-pass approach works, but it's a brute-force solution to what's really a data modeling problem.