The Per-Tenant Rate Limit That Wasn't Per-Tenant | Journal

Tenant A blocked correctly at request 11. Tenant B also blocked at request 11.

Phase 7 added a rate limit on POST /api/events that reads from tenants.config.rate_limits.events_per_minute. The default cap is 200 requests per minute. A tenant who needs more can be raised to 1000 via an admin PATCH. The integration test seeded two tenants (one on the default, one bumped to 100) and asserted that tenant A got blocked at request 11 while tenant B kept going. Tenant B was supposed to get nine more requests through.

The resolver function was fine. The unit tests on resolveTenantEventsRateLimit() all passed. Pass a tenant config with no override, get 200. Pass a tenant with events_per_minute: 100, get 100. Pass null, get 200. Pass anything over 1000, get clamped to 1000. Five tests, all green. The function did exactly what it was supposed to do.

The function was being called with null.

Where the Resolver Was Called With Null

@fastify/rate-limit accepts a max option that can be a function. The function receives the request and returns the cap to apply. Putting the resolver behind that callback was the design: each request asks "what is this tenant's limit?" at the moment the rate check runs, so changing a tenant's config takes effect immediately without a process restart.

For the resolver to do its job, it needs request.tenant already populated. That happens in authPlugin, which reads the X-API-Key header, looks up the tenant row, and attaches the tenant config to the request. The auth plugin runs in the onRequest lifecycle hook.

By default, @fastify/rate-limit also runs in onRequest. Fastify fires hooks in registration order within the same lifecycle stage. The rate-limit plugin was registered before the route was registered, and the route's auth runs as part of the route's plugin chain. The two onRequest hooks fired in an order that meant rate-limit's max callback ran first, with request.tenant still undefined.

The resolver, faithful to its contract, returned the default 200 for every request because every request looked like a tenant with no config override. Tenant A with the default and tenant B with 100 both got the same global cap. The endpoint behaved exactly as if the per-tenant feature didn't exist, but with no error and no warning, because both halves of the system were doing what they were told.

Why Moving Auth Wasn't an Option

The fix could not move auth. authPlugin runs in onRequest for every protected route in the entire API, and most of those routes do not need rate limiting. Pushing auth to preHandler to accommodate one route changes the lifecycle for every route, which moves a documented contract for one accidental dependency. That is the sort of fix that creates three new bugs to make one go away.

The fix also could not run rate-limit globally with a different hook. The rate-limit plugin can be registered globally with a custom hook stage, but that overrides the stage for all rate-limited routes. Every other rate-limited route in the codebase, including the admin endpoints with their own static caps, would be affected by a change made for events.routes.ts.

What was needed was a per-route override that moved only this rate check to a later hook stage, leaving the rest of the system on defaults.

hook: 'preHandler' Plus a Real keyGenerator

@fastify/rate-limit exposes a per-route option called hook inside the route's config.rateLimit object. Setting it to 'preHandler' means this specific route's rate check fires in the preHandler stage instead of onRequest. Auth has already run by then. request.tenant is populated. The resolver gets a real tenant config and returns the right cap.

config: {
  rateLimit: {
    hook: 'preHandler' as const,
    max: (request) => resolveTenantEventsRateLimit(request.tenant ?? null),
    keyGenerator: (request) => request.tenantId ?? request.ip ?? 'anonymous',
    timeWindow: '1 minute',
  },
},

Three things in that block matter, and the hook line is only one of them.

The keyGenerator is the second multi-tenant fix. The default key generator is the request IP. In a single-tenant deployment that is fine. In a multi-tenant deployment behind a shared NAT (two tenants whose servers happen to live in the same cloud region, two CI pipelines running tests behind the same egress address) the bucket gets shared. Tenant A burns through tenant B's allowance. The cap stops being per-tenant in a different way: the resolver returns the right number, but the bucket it counts against is wrong.

Setting keyGenerator to request.tenantId partitions the bucket per authenticated tenant. The fallback to request.ip covers the unauthenticated case (which the route does not allow, but defensive code is cheap and explicit defaults beat surprising ones).

The third thing is the resolver clamp at 1000. The admin route validates input at 1-1000, but a tenant whose config gets corrupted, or an old tenant from before the validation existed, could in principle have a number out of range stored in their JSONB config. The resolver caps the return value defensively. A misconfigured tenant cannot DoS the platform by setting events_per_minute: 999999 directly in the database.

Why a Single-Tenant Test Would Have Shipped This

The hook-ordering issue was not in the documentation for @fastify/rate-limit because it is not a bug in the plugin. The plugin defaults to onRequest because that is the right stage for rate limiting in 99% of cases. You want to reject over-limit requests as early as possible, before any work happens. The Hub's case is the 1% where the rate check depends on data that is only available after another hook has run. The plugin gives you the escape hatch, but you only know to use it once you have hit the problem.

The integration test that caught this was not a sophisticated test. It seeded two tenants, fired eleven requests for each, and asserted on the response codes. An earlier draft only seeded tenant A with a cap of 10 and asserted it got rate-limited at request 11. Tenant A was the only tenant. The test passed. I added tenant B with cap 100 specifically to verify that the per-tenant bucket worked, and the test failed with both tenants getting blocked.

If I had only tested one tenant, the rate limit would have shipped looking like it worked. Every tenant on the platform would silently be on the same global cap, and the only way to discover it would be a tenant raising a support ticket about being blocked at unexpected request counts. That is the kind of bug that runs in production for months because nothing visibly breaks.

The test fix was as simple as the production fix. Changing hook: 'preHandler' and keyGenerator: req.tenantId made the assertion pass. Total LOC change: two lines. Total time spent debugging from "why does tenant B also block" to "Fastify hook ordering": about an hour, most of it spent printing request.tenant from inside the max callback and confirming it was undefined.

What Else the Patch Quietly Got Right

H7 ships with per-tenant rate limiting that actually scopes per tenant. Tenant default is 200 requests per minute, raised to a configurable value via PATCH /api/admin/tenants/:id/rate-limit, capped defensively at 1000. The admin route uses spread-merge so updating rate_limits.events_per_minute preserves the rest of tenants.config (the channel credentials, the dedup window, the sandbox flag) without overwriting them. The PATCH was simple to write. Forgetting to use spread-merge would have been a different production incident: an admin update silently wiping every tenant's Resend API key.

Test coverage went beyond the spec. The resolver got five tests covering null config, missing override, explicit override, cap-at-1000, and the default. The integration suite got five more covering the per-tenant bucket, admin happy path, range validation at both ends, 404 on missing tenant, and config preservation across the update. The admin tests assert that channel credentials and dedup_window survive the PATCH untouched. The spread-merge contract is the thing that has to keep working, not just on the day it was written, but on every future change to the admin route.

The hook-ordering trick is now in the project's pattern 004 (per-route rate limit) as a note for the next route that needs a dynamic per-tenant cap. It will not be the last one. The same pattern will apply to per-tenant request limits on the suppressions endpoints, the templates endpoints, and any future endpoint where the cap depends on tenant config.

The fix is one line. The understanding it required is the lifecycle of every plugin in the request chain and the order in which Fastify will call them. That is the kind of detail that does not show up in a unit test for a resolver function.

Where the Resolver Was Called With Null

Why Moving Auth Wasn't an Option

hook: 'preHandler' Plus a Real keyGenerator

Why a Single-Tenant Test Would Have Shipped This

What Else the Patch Quietly Got Right

Contents