Why I Refused to Re-Query the Tenant Row at Alert Dispatch Time | Journal

What does an audit-grade risk alert look like six months after the band crossing it describes, when the tenant has renamed itself twice in the interim, when the vendor has been merged into another, when the scoring rule has been retuned three times since? In a system that re-queries the source rows at dispatch time, the answer is whatever is true now. In this system, the answer is whatever was true at 14:03 on the Tuesday the alert was written.

The concrete shape of the problem: a risk alert is created at 14:03 Tuesday for a vendor crossing from medium to high. The Notification Hub is down for emergency Postgres maintenance. The dispatcher falls back to the failed-alert retry queue and tries again every 30 minutes. By Friday afternoon, when the Hub is back, the email body must still read Acme GmbH, the legal name that was current Tuesday, not the post-Wednesday rebrand to Acme Industrial GmbH that the operator entered through the settings UI on Wednesday morning. A live re-query of the tenants row at send time would emit the new name and reference an event at a company that didn't exist when it happened.

That's the constraint this whole architecture is designed around. Not the failure mode I worried about most. The one I knew would happen and didn't have a clean answer to.

The Real Problem

Most engineers I've talked to assume the issue is performance. They look at a table with a JSONB column called delivery_payload holding a fully-rendered tenant snapshot, and they ask why I'd denormalize. The tenants row has 11 identity columns. Why pickle them into JSON when I could just SELECT them again at dispatch time?

Performance has nothing to do with it. The dispatcher is fast either way. What matters is that an alert is a fact about what was true at a moment in time, and a re-query produces a fact about what is true now. Those two facts are not interchangeable. An auditor reading the alert ledger six months later wants to know what the operator saw on Tuesday afternoon, not what the system shows today.

Same problem on the report side. A vendor scorecard PDF generated in March needs to reprint byte-for-byte (or close enough to it) when an auditor downloads it again in September. If the underlying tenant or vendor row has been mutated in between, a live re-query produces a different document. The PDF that was approved by procurement leadership in March no longer exists on disk. It exists in the operator's email and nowhere else.

The mistake is one of category. The system was treating mutable rows as the source of truth for an immutable record. SQL was doing exactly what SQL does: returning the current value of the column. The error was in the architecture asking the question that way at all.

The Constraints

Three things made this hard.

Tenant identity is mutable by design. A CPO can update their legal name, registration number, address, and brand colors at any time through the settings UI. We don't lock fields after creation; that would be operationally hostile to the business reality of mergers and rebrands. So the row is the live identity, full stop.

Vendors get merged. When the operator confirms that two vendor_aliases rows point at the same legal entity, the system collapses them and the surviving vendor inherits the signals from the merged one. If an alert fires referring to "Vendor A" and that vendor gets merged into "Vendor B" three weeks later, a re-query at dispatch time would emit "Vendor B" in the email body. That isn't just confusing; it's wrong. The alert was about A.

Retries can run for days. The notification hub has a circuit breaker (5-failure rolling window, 60-second cooldown). If the Hub is genuinely unhealthy for a sustained period, the failed-alert retry job picks alerts back up every 30 minutes and tries again. There is no upper bound. I have seen alerts retry across 72-hour outages. Anything the dispatcher reads at send time can have moved arbitrarily far from where it was at create time.

The obvious moves were all bad. Locking the tenant row after first alert? Operationally hostile. Versioning every field on the tenant table with effective-date ranges? A second source of truth and a permanent maintenance burden. Re-creating the alert if the tenant changed? You can't, because the alert is a record of an event that already happened.

The Design

The architecture has two halves: storage immutability and rendering immutability. Both are necessary. Either alone is insufficient.

On the storage side, every alert carries a delivery_payload JSONB column populated at insertion. The capture path lives in lib/alerts/capture_payload.rb and is called from the alert dispatcher exactly once, at the moment the risk_alerts row is INSERTed:

# lib/alerts/capture_payload.rb
def call(vendor_score:)
  tenant = vendor_score.tenant
  vendor = vendor_score.vendor
  tenant_snapshot = Tenants::CaptureSnapshot.call(tenant.id)

  payload = {
    event_type: event_type,
    tenant: tenant_snapshot,
    vendor: vendor_block(vendor, tenant_snapshot),
    score: score_block(vendor_score, previous_composite, previous_band, direction),
    top_contributors: top_contributors_block(vendor_score),
    deep_links: deep_links_block(vendor),
    created_at: Time.now.utc.iso8601
  }

  deep_freeze(payload)
end

Tenants::CaptureSnapshot is the single canonical builder for the tenant identity block. Eleven columns from the tenants row plus a snapshot_at timestamp, returned as a frozen Hash. The shape is locked. Adding a column to tenants does not automatically add it to the snapshot. Every addition is a deliberate change to the snapshot, the templates that bind to it, and the test fixtures.

That's the first half. The dispatcher (Alerts::HubDispatchJob) then has exactly one rule: read alert.delivery_payload and never query tenants, vendors, or vendor_scores again. The job's class doc is a paragraph explaining this. The integration test fires an alert, mutates the underlying tenant row, then runs the dispatcher and asserts the emitted Hub event still contains the original literal values. Without that test, a casual refactor could re-introduce a Tenant.find(alert.tenant_id) call in the dispatcher, and the regression would only show up in production after the first sustained Hub outage.

The second half is in-memory immutability. A frozen top-level Hash doesn't prevent a careless caller from mutating a nested Hash:

# This still works on a top-level-frozen Hash:
payload[:tenant][:legal_name] = "Sneaky GmbH"
# raises FrozenError only if every nested level is also frozen

So the capture path walks the structure recursively:

def deep_freeze(obj)
  case obj
  when Hash
    obj.each_value { |v| deep_freeze(v) }
    obj.freeze
  when Array
    obj.each { |v| deep_freeze(v) }
    obj.freeze
  when String
    obj.freeze
  else
    obj
  end
end

A Sidekiq worker that loads the JSONB column gets a fresh Ruby Hash from JSON.parse, which is mutable, but the in-memory structure built by CapturePayload.call is physically immutable at every level the moment it leaves the method. That removes a class of bugs where a logging middleware or a serializer transformation accidentally mutates the snapshot in flight before it's emitted to the Hub.

The rendering side is the same idea applied to templates. Every Hub Liquid template registers with strict_variables: true. Every report ERB template uses a small helper called Reports::StrictFetch that walks dotted paths against the captured render context and raises StrictFetchError on any unresolved segment. A template that references {{ tenant.legal_nme }} (typo) doesn't silently emit an empty string. It raises during the test, before anything ships.

The CI gate that locks all of this is in test/integration/report_template_lint_test.rb. It renders every report template against captured render contexts for two distinct tenants, then parses each template for f("…") calls without a default: argument to enumerate the mandatory token set. The test will fail loudly if anyone adds a new template token without extending the snapshot, or adds a field to the snapshot but forgets to update a downstream template.

I had this turned into the cleverest part of the suite by accident. After the first version of the test passed, I noticed it would also pass if I weakened StrictFetch to silently return nil for missing paths. So I added a deliberate-failure regression test: a synthetic broken template that references {{ tenant.this_field_does_not_exist_anywhere }}. If anyone weakens the strict-fetch contract, that meta-test fails. Without it, the regression test for the regression test, the gate could rot.

What Surprised Me

The byte-identical re-render test for the four report types did not work the first time, and the failure mode taught me something I would not have figured out from documentation.

I picked the legal-footer line as my canary literal: Reg: HRB-123456 · Tax: DE987654321 · Berlin, DE. The PDF visually rendered correctly. But pdf-reader text extraction returned a nil for that line. After two hours of debugging, I figured out that wkhtmltopdf encodes the · HTML entity as a non-standard glyph escape that pdf-reader's content-stream decoder doesn't recognize, and the entire PDF text object containing that escape gets dropped from the extraction.

The fix was to pick canary literals from lines that don't contain · separators (header legal_name, address line1, contact email are all safe). But the lesson was the deeper one: byte-identical PDF re-rendering is impossible because wkhtmltopdf embeds creation timestamps and random object IDs in every render. CSV outputs are bytewise equal across renders; PDFs are content-equal via text extraction. I had to pick the right level of the immutability claim.

The Result

A 30-day audit-reprint test runs on every CI build. It captures a render context, generates the PDF and CSV, runs Timecop.travel(30.days), mutates the underlying tenants row, and re-renders. CSV output is exactly byte-equal. PDF text-extracted output contains the original tenant literals (Acme GmbH, Hauptstraße 10, procurement@acme-gmbh.example) and not the new ones. The same gate runs separately for the alert side: the dispatcher integration test fires an alert, mutates the tenant, runs the dispatcher, and asserts the emitted Hub event contains the pre-mutation legal name.

The architecture is denormalized. Every alert duplicates the tenant identity. Every report context duplicates the entire scoring snapshot. Storage cost is real but small (typical alert payload is ~3KB; a year of band-crossing alerts at 50 vendors per tenant is well under 100MB per tenant). The cost is paid once at write time and never paid again at read time.

The takeaway, if there is one: a snapshot has a different job from a normalized table. The normalized rows are operational state, useful for the live dashboard and the next score recompute, and entirely correct to mutate when the business changes. The frozen JSONB is the legal record. It's what you point at six months later when an auditor asks you to prove what the operator saw on a Tuesday afternoon in April.

The Real Problem

The Constraints

The Design

What Surprised Me

The Result

Contents