I Spent a Week Securing Webhook Ingestion. The Real Attack Surface Was Delivery. | Journal

I ran the security review two weeks after the first deployment. The ingestion side looked solid: HMAC signature verification using crypto.timingSafeEqual, rate limiting at 1,000 requests per minute, payload size capped at 1MB, idempotency deduplication on every incoming event. I was satisfied with the input boundary. Then I traced what happens after an event is accepted, through the delivery worker and out to destination URLs, and realized I'd spent a week protecting the wrong end of the system.

The ingestion endpoint validates who is sending. But the delivery worker, the component that forwards payloads to downstream URLs registered by tenants, makes outbound HTTP requests from inside the server's network. That side had no protection at all.

Two attack surfaces, nothing in common

A webhook gateway has two distinct threat models, and they require completely different defenses.

On the ingestion side, the threat is an external attacker sending malicious payloads: forged signatures, oversized bodies designed to exhaust memory, replay attacks using captured requests, and deeply nested JSON meant to overflow the parser stack. The defense strategy is conventional. Validate everything at the boundary before it touches the database.

On the delivery side, the threat comes from inside. A tenant registers a destination URL. The URL passed validation when it was created: api.customer.com resolved to 34.120.18.42, a legitimate GCP load balancer. But URLs are just DNS pointers, and DNS records change. Three weeks later, the same hostname resolves to 169.254.169.254, the cloud metadata endpoint on every major provider. The webhook gateway, running inside the VPS, dutifully POSTs the event payload to the internal metadata service.

This is Server-Side Request Forgery through DNS rebinding. The gateway becomes an open proxy for any tenant who controls their DNS records. And the creation-time URL validation caught none of it, because the DNS resolution happened at delivery time, weeks after the destination was registered.

The exposure isn't limited to metadata endpoints. Private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback addresses, the link-local range (169.254.0.0/16), IPv6 unique local addresses (fc00::/7), and protocol handlers like file:// or gopher:// are all reachable from the delivery worker's network context. The attack surface is larger than it first appears.

What made it hard

Three constraints shaped the fix.

First, the system delivers webhooks continuously. The SSRF check runs on every delivery, not just at destination creation. Any latency added to the DNS resolution and IP validation eats into the 10,000ms delivery timeout budget. The check needed to complete in single-digit milliseconds under normal conditions, which ruled out external SSRF-detection services and pre-resolution caching (stale DNS entries defeat the purpose).

Second, the existing delivery worker was the hottest code path in the system: load event, load destination, make HTTP call, record result, calculate backoff, schedule retry or promote to dead letter. Ten concurrent workers execute this pipeline on every job. Inserting validation in the middle of this sequence meant touching every execution path without breaking the retry logic, the exponential backoff calculations in calculateBackoffMs(), or the dead-letter promotion threshold.

Third, some destinations legitimately resolve to IPs that look suspicious to a naive blocklist. A customer running their webhook handler on a small hosting provider might have an IP range that borders private space. The validation needed to be strict about RFC 1918 ranges while returning clear, actionable error messages when it rejected a delivery, so tenants could debug the issue without opening a support ticket.

Six layers, added iteratively

The initial deployment shipped with HMAC signature verification and rate limiting on the ingestion endpoint. The remaining layers arrived in a separate wave of commits, all prefixed fix( rather than feat(, because each one addressed a gap I discovered after the pipeline was already handling traffic.

Ingestion input protection:

Rate limiting caps ingestion at 1,000 requests per minute per source. Payload size is limited to 1MB. JSON parsing enforces a nesting depth limit of 20 levels (MAX_JSON_DEPTH in src/server.ts) to prevent stack overflow from recursive structures. HMAC signature verification uses crypto.timingSafeEqual to prevent timing side-channels. The verifySignature() function in src/ingestion/signature.ts supports both hmac-sha256 and hmac-sha1, strips algorithm prefixes (GitHub sends sha256=<hex>, Stripe uses t=<unix>,v1=<hex>), and applies a configurable timestamp tolerance. The extractTimestampFromHeader() function parses Stripe's format specifically: t=<unix_seconds>, at the beginning of the header value, converted to milliseconds. Any signature older than 5 minutes (the SIGNATURE_TOLERANCE_MS default of 300,000ms) is rejected before the HMAC is even computed.

Delivery output protection:

Layer one is validateDestinationUrl() in src/lib/url-validator.ts, called at destination creation. It rejects non-HTTP protocols immediately. It blocks localhost and checks raw IP addresses against private ranges via isPrivateIp(). The IPv4 check walks the octets: 127.x is loopback, 10.x is private, 172.16-31.x is private, 192.168.x is private, 169.254.x is the link-local range that covers cloud metadata endpoints on AWS and GCP. IPv6 gets separate handling: ::1 (loopback) and fc00::/7 (unique local, covering both fc and fd prefixes).

Layer two is resolveAndValidateUrl(), called before every delivery. This is the DNS rebinding defense:

// src/lib/url-validator.ts
export async function resolveAndValidateUrl(url: string): Promise<string> {
  validateDestinationUrl(url);

  const parsed = new URL(url);
  const hostname = parsed.hostname.replace(/^\[|\]$/g, "");
  if (net.isIP(hostname)) return url;

  const addresses = await dns.resolve4(hostname).catch(() => []);
  const addresses6 = await dns.resolve6(hostname).catch(() => []);
  const allAddresses = [...addresses, ...addresses6];

  if (allAddresses.length === 0) return url;

  for (const addr of allAddresses) {
    if (isPrivateIp(addr)) {
      throw new UnsafeUrlError(
        `DNS rebinding detected: '${hostname}' resolves to private IP '${addr}'`
      );
    }
  }
  return url;
}

Every A and AAAA record is checked. If any resolved IP falls in a private range, the delivery is rejected with a specific error message naming the hostname and the offending address. The function resolves IPv4 and IPv6 in parallel to minimize latency. If DNS resolution fails entirely, the request is allowed through because the subsequent fetch will fail with a network error on its own. No errors are swallowed silently.

I got this wrong the first time. The initial implementation only had layer one: creation-time validation. I assumed that if the URL was safe when the tenant registered it, it would stay safe. It took reading through SSRF post-mortems to realize that DNS-based attacks bypass creation-time checks entirely. The fix shipped as its own commit: fix(lib): add SSRF protection, helmet headers, raw body HMAC, patch drizzle CVE.

The delivery worker also caps redirect chains at 3 hops. The deliverWebhook() function in src/delivery/http-client.ts follows redirects manually instead of relying on fetch's default behavior, and truncates response bodies to 4,096 bytes to prevent a destination from returning a multi-megabyte response that fills the deliveries table.

Data protection:

Header sanitization strips Authorization and Cookie headers from stored payloads at ingestion time. These headers might be present in the original webhook request (some providers include auth tokens), and forwarding them to a third-party destination would leak credentials. Signing secrets for HMAC verification are encrypted at rest with AES-256-GCM. The encrypt() function in src/lib/crypto.ts generates a random 12-byte IV per encryption and packs IV, auth tag, and ciphertext into a base64 string. The decryption key lives in an environment variable (SIGNING_SECRET_KEY), separate from the database. A breach that exposes the sources table gives the attacker ciphertext, not usable keys.

What changed

The security hardening added roughly 400 lines of validation code to a system that was already functionally complete. The url-validator.ts module alone is 195 lines of checks that don't change what the system does. They change what it refuses to do.

Before: the delivery worker would POST to any URL a tenant registered, follow unlimited redirects, and forward all original headers. After: every delivery passes through protocol filtering, hostname blocking, static IP range checks, dynamic DNS resolution with IP validation, redirect limiting at 3 hops, header sanitization, and response body truncation at 4,096 bytes.

The ingestion side now has signature verification with timing-safe comparison and 5-minute stale tolerance, payload caps at 1MB, and JSON depth limiting at 20 levels. Six overlapping layers, three on each side of the persist-before-process boundary.

None of these layers trust the previous one. The delivery-time DNS check doesn't assume the creation-time URL check caught everything. The JSON depth limit doesn't assume the payload size limit prevented pathological inputs. Every layer operates under the assumption that the one before it was bypassed or insufficient. When you build infrastructure that accepts input from the internet and acts on it inside your network, the input validation is the obvious problem. The delivery side, where your system becomes a network actor on behalf of untrusted tenants, is where the real exposure hides.

Two attack surfaces, nothing in common

What made it hard

Six layers, added iteratively

What changed

Contents