~/About~/Foundry~/Blueprint~/Journal~/Projects
Book a Call
Blueprint

Vendor Performance Intelligence Engine

·8 min read·Kingsley Onoh·View on GitHub

Architectural Brief: Vendor Performance Intelligence Engine

A Rails 8 multi-tenant scoring engine that turns behavioral signals from procurement systems into a single composite risk score per vendor per tenant. The scoring math is the easy part. The hard constraint is reproducibility: every alert email and every regenerated PDF report must be byte-identical six months after first delivery, even if the tenant renames itself, the vendor gets merged into another, or the scoring rule is retuned in between. The architecture exists to make that constraint physically impossible to violate.

System Topology

Infrastructure Decisions

Each bullet names the choice, the alternative considered, and the reason. No bullets without defense.

  • Compute: Docker on a single Hetzner VPS, Puma + Sidekiq. Chose over a managed PaaS because the engine runs at one node for a single tenant cluster and the operator is the same person doing the deploys; managed PaaS adds vendor lock-in and per-dyno billing without solving any actual problem at this scale. A second region or active-passive HA isn't on the roadmap.
  • Data layer: PostgreSQL 16 with native declarative partitioning. Chose over pg_partman (which the PRD originally specified) because pg_partman requires a custom Postgres image and a Postgres extension to harden, while native PARTITION BY RANGE (recorded_at) plus a daily PartitionManagerJob that issues CREATE TABLE ... PARTITION OF covers the same operational ground without an extra dependency.
  • Background jobs: Sidekiq 7 on Redis 7. Chose over Solid Queue because Sidekiq's mature failure semantics (separate retry queue, dead set, sidekiq-cron scheduled jobs, and a stable web UI) matter for the long-running NATS consumer and the daily rescore. Solid Queue is fine for low-throughput jobs; the band-crossing path runs on every signal insert and needs the throughput floor.
  • UI: Hotwire (Turbo + Stimulus) + ViewComponent + Tailwind. Chose over a SPA because the operator surface is forms, tables, and turbo-frame replacements; a React-Router app would be 10x the JavaScript for the same operator workflow. ViewComponent gives per-component test isolation that ERB partials don't.
  • Authorization: ActionPolicy. Chose over Pundit because tenant scoping is enforced at the policy layer, not in controllers, and ActionPolicy's pre-checks integrate with Hotwire's frame rendering more cleanly than Pundit's resource-loading model. Cancancan was rejected up front because of the global-scope ergonomics.
  • HTTP clients: Faraday 2 singletons in lib/ecosystem/. Chose over per-call client instantiation because each adapter hits a long-lived ecosystem service with retry middleware and a circuit breaker, and re-creating connections per call wastes the connection pool. Initialized in config/initializers/ecosystem_clients.rb with an at_exit SIGTERM hook.
  • Ingress validation: dry-validation 1.x. Chose over ActiveRecord validations for non-model inputs because ingestion payloads have a strict schema with sentinel rejection reasons (FUTURE, STALE, WINDOW_INVERTED, UNKNOWN_SIGNAL_CODE, VALUE_OUT_OF_RANGE, MISSING_VENDOR_REF) that need to round-trip into vendor_signals.rejection_reason. ActiveRecord validation errors don't translate to that contract cleanly.
  • PDF rendering: WickedPDF (wkhtmltopdf). Chose over Prawn because the legal footer for compliance-grade reports requires standard browser CSS, not a Ruby DSL, and operators want to author templates in HTML. The known cost: wkhtmltopdf embeds a creation timestamp + random object IDs in every render, so byte-equality across re-renders is impossible (CSV outputs are bytewise equal; PDFs use pdf-reader text extraction for content equality).

Constraints That Shaped the Design

  • Input: five upstream signal sources (Invoice Reconciliation, Contract Lifecycle, Webhook Engine, Transaction Reconciliation, manual REST), one inbound HMAC-verified Hub fanout endpoint, and a CSV/JSON manual upload path. Every source produces an event_id for dedup, a vendor_ref block for resolution, and a signal_code from the 20-row catalog.
  • Output: a 0-100 composite score per vendor per tenant, a four-band classification (low/medium/high/critical), a top-five contributors block, a band-crossing event published to the notification hub, and four scheduled report types with byte-stable re-render guarantees.
  • Scale handled: monthly partitions comfortably hold ~10M signal rows each. Composite scoring p95 sits at 132ms against a 500ms budget; band-to-Hub p95 at 511ms against a 2,000ms budget; reports finish under 2.6 seconds against a 30-second budget. Above ~10M signals per month per tenant, the partition cadence drops to weekly and the daily rescore needs work-stealing across Sidekiq queues.
  • Hard constraints: alert history and report renders must be reproducible at any point in the future even if the underlying tenant or vendor rows have been mutated. The Hub ingress endpoint must reject any payload that fails HMAC verification with a constant-time comparison. Tenant isolation must be enforced at every query, every job, and every cross-tenant request must return 404 (not 403, which leaks existence).

Decision Log

Decisions not already covered above. Operational, testing, scaling, data modeling, and deployment choices that an architect evaluating the system will want to interrogate.

Decision Alternative Rejected Why
Append-only vendor_signals enforced by a Postgres trigger Application-layer guard only The model layer can be bypassed by a rake task or a one-off psql session. The DB trigger blocks UPDATE on every column except status along legal transitions (raw to normalized, normalized to scored, normalized to superseded) and rejects all DELETEs. Defense in depth, not redundancy.
Frozen JSONB delivery_payload and render_context columns Re-query tenants/vendors at dispatch and render time The Sidekiq retry can fire days after the alert was created. If the tenant renames itself or the vendor gets merged in between, re-querying produces a different email body. The dispatcher reads the column and only the column.
Recursively-frozen Ruby Hashes in memory Top-level .freeze only A Sidekiq job that loads delivery_payload from JSONB into a Hash can mutate any nested Hash or String even if the outer object is frozen. The capture path walks the structure and freezes Hashes, Arrays, and Strings at every level so the in-memory copy is also immutable.
Liquid templates rendered with strict_variables: true Lenient rendering with empty-string fallbacks A typo in a template token ({{ tenant.legal_nme }}) silently emits an empty string under lenient rendering. Strict mode raises during the test, not in production. CI runs every Hub template against ≥2 fixture tenants and asserts no cross-tenant leakage.
Two-tenant fixture mandate (acme-gmbh-de + globex-inc-us) Single tenant test fixtures Every template binding test runs against both tenants with intentionally different legal_name, address.country_code, locale (de-DE vs en-US), timezone (Europe/Berlin vs America/New_York), and brand colors. A template that hardcodes "Acme GmbH" passes with a single fixture and ships to production. The two-tenant gate catches cross-tenant leakage at RED.
Native Postgres logical partitioning with monthly partitions Time-series database (TimescaleDB) TimescaleDB would solve the partitioning problem at the cost of an entire DB extension to harden and a separate set of operational primitives. The signal volume is well within native Postgres territory; the daily PartitionManagerJob is 30 lines.
Standalone-first feature flags on every ecosystem connection Required ecosystem dependencies A customer cloning the repo and running docker compose up gets a fully functional product with NOTIFICATION_HUB_ENABLED=false, WORKFLOW_ENGINE_ENABLED=false, and so on. The core scoring loop accepts manual POST /api/signals and runs without any sibling service.
Tenant resolution via 12-char api_key_prefix lookup + constant-time SHA-256 compare Full hash lookup per request A constant-time comparison over the full hash is fine, but it requires a full table scan or a hash index, neither of which scales gracefully under concurrent registration. The 12-char prefix gives a partial-index seek to the candidate row, then constant-time SHA-256 compare confirms or rejects. Cached through Cache::TenantCache.
Per-tenant scoring_rules row with one is_active=true at a time Global scoring config One CPO weights financial signals at 50 percent. Another weights contractual signals higher because they operate in regulated industries. The active rule is a tenant-scoped row, with category weights, signal weight overrides, band thresholds, window days, and time-decay half-life. Activating a clone atomically deactivates the prior rule via ScoringRule#deactivate_sibling_if_activating.
#ruby#rails#postgresql#sidekiq#hotwire#multi-tenant#partitioning

The complete performance for Vendor Performance Intelligence Engine

Get Notified

New system breakdown? You'll know first.