Architectural Brief: Invoice Reconciliation Engine

Three documents must reconcile. The purchase order, the goods receipt, and the vendor invoice. Each written at a different time, each with its own format, and none of them agree exactly. The matching engine's entire job is deciding which differences are acceptable and which ones are not. Everything in the architecture below serves that decision.

System Topology

Architecture diagramScroll on small screens

Infrastructure Decisions

Language: Kotlin 2.0. Chose over Java because coroutines + data classes + null safety model financial entities precisely, and over Go because the JVM ecosystem (Exposed, Flyway, Testcontainers) is still the mature path for schema-heavy transactional workloads. The portfolio also had no JVM language on it before this project, which matters for answering "what do you pick when the job wants JVM."
Framework: Ktor 2.x. Chose over Spring Boot because Ktor's plugin phase model maps 1:1 to feature-flagged ecosystem integrations. Each external client (Notification Hub, Workflow Engine, Recon Engine, RAG, NATS) turns on at runtime via an *_ENABLED env var. Spring's convention overhead costs more than it saves for a focused service where every dependency is already explicit.
ORM: Exposed DSL mode, not DAO. Chose over Hibernate/JPA because the matching queries need direct SQL control: joins across invoice lines, PO lines, and receipt lines, every query scoped by tenant_id, with trigram-backed fuzzy lookups. ORMs obscure the layer where performance work actually happens. Exposed stays close to SQL while keeping Kotlin's type-safe query builder.
Database: PostgreSQL 16 with pg_trgm. Chose over MongoDB because the reconciliation domain is inherently relational: vendors own POs, POs own lines, receipts reference PO lines, invoice lines match to PO lines. Every interesting query is a join. Trigram indexes give the fuzzy vendor-name lookups for free, no separate search service.
Cache / lock: Redis 7. Chose over in-process locks because matching runs must be serialized per tenant. Two concurrent runs overwriting the same invoice's status would corrupt the audit trail. Redis gives a distributed lock that works whether the service runs on one node or five.
Broker: NATS JetStream. Chose over Kafka because the only cross-service event emitted is invoice.discrepancy.detected, consumed by the Compliance Ledger. Kafka's operational weight (Zookeeper or KRaft, topic partitioning, consumer groups) is designed for volumes this engine will not produce. NATS fits the one-subject-one-consumer shape exactly.
Money representation: BIGINT integer cents everywhere, never NUMERIC. Chose over DECIMAL/BigDecimal storage because cents remove the "which rounding mode?" question from every operation. €199.99 is 19999L. Addition, percentage checks, and tolerance math are pure integer arithmetic. BigDecimal is used only at the API boundary (the AmountUtils.toCents/fromCents conversions) and for fractional quantities (NUMERIC(12,4) on PO lines, since kg and liters exist).
Webhook auth: per-tenant HMAC secrets, not one shared secret. Chose over a global WEBHOOK_HANDLER_SECRET because one secret is one blast radius. POST /api/webhook-handler/{tenantId} reads the secret from that tenant's settings.webhook_secret JSONB, verifies with MessageDigest.isEqual for constant-time comparison, and falls back to the global secret only if a tenant has not configured their own. Rotating one tenant's key does not affect any other tenant.
Deployment: single Docker container on DigitalOcean VPS behind Traefik. Chose over Kubernetes because this service does not need orchestration yet. One container, one database, one Redis, one NATS, all on one box, pulled automatically by Watchtower when a new image lands in GHCR. The same VPS hosts several other portfolio services; the margin is sharing infrastructure, not running a cluster.

Constraints That Shaped the Design

Input: invoices arrive three ways: REST API (POST /api/invoices), CSV import (POST /api/invoices/import/csv), or accounting-platform webhooks (QuickBooks, Xero) via the Webhook Ingestion Engine's fan-out. Each source hits the same InvoiceService.create() after normalization, so matching logic never branches on source.
Output: a decision per invoice: approved (auto-paid), pending approval in standard or escalated queue, or rejected. Plus zero or more discrepancies with type + severity, plus an immutable audit_log entry for the decision, plus an event on NATS if any discrepancy is high or critical.
Scale handled: a batch of 100 invoices × 10 lines each completes matching in under 15 seconds on a 2-vCPU VPS. The MatchingEngine caps each run at 100 invoices by default (DEFAULT_MAX_INVOICES); runs above that are split by the scheduler, which fires every 30 minutes by default.
Hard constraints:
- Every query must include WHERE tenant_id = :tid. No exceptions, enforced at the repository layer. A cross-tenant leak is a trust-destroying bug, not a correctness bug.
- audit_log is append-only. No UPDATE or DELETE is ever issued against it. This is the row the auditor reads.
- Ecosystem HTTP calls go through a circuit breaker (CircuitBreaker.kt): five consecutive failures trip the breaker, which rejects calls for 60 seconds before a single half-open probe. A downstream outage cannot take down the matching pipeline.
- Discrepancy events fire on the NATS consumer side, not inside the request. A supervisorScope in MatchingPostProcessor isolates one bad invoice from the rest of a batch.

Decision Log

Decision	Alternative Rejected	Why
Weighted confidence score (`PO·30 + line·30 + receipt·20 + price·20`) with breakdown stored in JSONB	Binary matched/unmatched flag	A finance reviewer looking at a 0.73 confidence needs to know which component dragged the score down. The JSONB breakdown lets them see `{po_match: 0.65, line_match: 0.92, receipt_match: 1.0, price_match: 0.35}` and know instantly that the price is the problem, not the match. The operator tunes remediation on the right axis.
Flyway migrations, not Exposed schema sync	Exposed `SchemaUtils.create()` at startup	Financial data migrations have to be reviewed, versioned, and reversible. Flyway gives V001 through V010 as auditable SQL files in `db/migration/`. Exposed's schema sync is fine for prototypes and dangerous for a table with an immutable audit log.
Two auto-approve paths: low amount (`< €100` default) OR high confidence (`≥ 0.95`) with zero discrepancies	Single confidence threshold	Small-dollar invoices should not route through a human queue even if the match is noisy. The review cost exceeds the discrepancy cost. High-confidence invoices shouldn't wait for review even if the amount is big. Two independent paths capture both cases without letting one override the other.
Self-registration disabled by default (`SELF_REGISTRATION_ENABLED=false`)	Open `POST /api/tenants/register`	The endpoint that creates a new tenant with a fresh API key is a self-service account factory. In production it's an attack surface. The flag is `true` in dev for quickstart and must be explicitly flipped to `false` before deployment.
Ecosystem integrations feature-flagged, circuit-broken, and silently skipped when disabled	Hard dependency on the ecosystem services	The engine must run standalone. If the Notification Hub is down, invoices still match. If the RAG Platform is disabled, no documents get pushed but everything else works. Every external call checks the flag first, then the breaker, and swallows the failure at the edge.
Ktor's status pages + content negotiation + request ID correlation installed globally	Per-route error handling	Every response should carry the same error envelope (`{error: {code, message, details}}`) and the same `X-Request-Id` for trace correlation. Global plugins enforce the shape. Per-route error handling drifts across modules within a week.
Matching caps at 100 invoices per run (`MAX_INVOICES_PER_RUN=100`)	Process everything queued	A run that tries to match 10,000 invoices holds a distributed lock, spins up thousands of PO queries, and fails halfway through. Batching lets the scheduler come back in 30 minutes and pick up what's left. Partial progress is better than a long lock.
Rate limiter moved from Call phase to Plugins phase (post-auth)	Rate limiter at Call phase (pre-auth)	Rate limiting anonymous traffic burns tenant budget on requests that never reach the tenant. Installing the limiter after auth means each tenant's bucket counts only authenticated requests. Anonymous callers hit a separate IP-based bucket with a much lower ceiling.

Architectural Brief: Invoice Reconciliation Engine

System Topology

Infrastructure Decisions

Constraints That Shaped the Design

Decision Log

Contents