Architectural Brief: Workflow Automation Engine
Every integration project starts the same way. Receive a webhook, call an API, check the result, do something depending on the response. The logic is simple. The plumbing isn't. Retry handling, state tracking, failure recovery, tenant isolation, audit logging. Four steps of business logic surrounded by hundreds of lines of infrastructure code. This system extracts that infrastructure into a standalone engine so the business logic is all that remains.
System Topology
Infrastructure Decisions
-
Language and Framework: Python 3.12 with FastAPI. Chose over Go because the broader ecosystem (Notification Hub, Webhook Engine) already uses Python and FastAPI. Sharing patterns, middleware, and deployment conventions across services reduces onboarding friction. Async-first via
asynciohandles the I/O-bound workload without threading overhead. HTTP step calls, webhook reception, and database writes are all non-blocking. -
Data Layer: PostgreSQL 16 with JSONB columns for workflow definitions and execution context. Chose over MongoDB because the relationships between tenants, workflows, executions, and steps are well-defined and need transactional consistency. JSONB gives the flexibility of document storage where it's needed (step configs, trigger payloads, per-execution context) without giving up ACID guarantees on the relational structure. Six tables, all with Row-Level Security policies enforcing tenant isolation at the database layer.
-
Task Queue: arq with Redis as broker. Chose over Celery because arq requires no separate broker service. Redis was already required for rate limiting and the in-memory workflow cache. arq uses the same Redis instance, runs natively with asyncio, and the entire queue infrastructure adds zero new services to the deployment. Celery would have introduced RabbitMQ or a dedicated Redis instance for minimal benefit at 10 concurrent workers.
-
Expression Engine: Jinja2 SandboxedEnvironment for user-provided data mapping expressions. Chose over a custom expression parser because Jinja2 supports nested object access (
{{ steps.fetch.output.body.name }}), has been battle-tested against injection attacks, and the Sandboxed variant blocks file system access, arbitrary imports, and code execution. Custom expression parsers accumulate security debt silently. StrictUndefined mode means missing variables raise errors instead of rendering as empty strings. -
Reverse Proxy: Traefik with automatic TLS via Let's Encrypt. Chose over Nginx because Traefik discovers Docker containers automatically via labels and handles certificate renewal without cron jobs or manual config. The app container doesn't touch certificates.
-
Tenant Isolation: PostgreSQL Row-Level Security on all 6 tables. Chose over application-only
WHERE tenant_id = ...filtering because one missed filter clause in any query leaks data across tenants. RLS enforces isolation at the database layer regardless of application bugs. The middleware setsapp.current_tenant_idper request, and PostgreSQL applies the policy automatically.
Constraints That Shaped the Design
- Input: Webhook payloads (POST with HMAC-SHA256 signature validation), cron schedules (5-field cron expressions via APScheduler), or manual API calls with optional trigger data. Maximum 1MB payload size.
- Output: Execution records with per-step input/output stored as JSONB, structured execution logs (30-day retention), webhook delivery audit trail (7-day retention), and real-time SSE streams for in-progress executions.
- Scale Handled: 50 steps per workflow, 10 concurrent steps per execution, 50 active cron workflows per deployment. Connection pool supports 15 concurrent database connections per process (pool_size=5, max_overflow=10). At 100+ concurrent executions across multiple app instances, the pool needs pgBouncer in front.
- Hard Constraints: 300-second execution timeout. 30-second HTTP step timeout. 1MB cap on webhook payloads and HTTP response bodies. Maximum 3 levels of sub-workflow nesting. DAG depth capped at 20 steps in the longest dependency chain (hardcoded in graph.py). All state transitions persist to PostgreSQL before the next step executes. Rate limiting at 100 requests per minute per tenant (5/min for registration, 20/min for write operations).
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| arq task queue over Celery | Celery with RabbitMQ | arq uses the existing Redis instance. No separate broker to deploy, monitor, or pay for. At 10 concurrent workers, Celery's distributed features add operational complexity without proportional benefit. |
| PostgreSQL RLS over app-layer filtering | WHERE tenant_id = X in every query | One missed filter clause leaks all tenant data. RLS enforces isolation at the database level. Application queries can still filter, but RLS is the safety net that catches bugs before they become breaches. |
| Persist-before-execute state model over fire-and-forget | Queue-only state with eventual DB sync | If the worker crashes between steps, fire-and-forget loses the execution state. Persisting each transition to PostgreSQL before proceeding means a crash leaves every execution in a known, recoverable state. |
| Jinja2 SandboxedEnvironment over custom expression DSL | Building a minimal expression parser | Custom parsers start simple and accumulate security holes as users request more features. Jinja2 with SandboxedEnvironment and StrictUndefined prevents both injection attacks and silent failures on missing variables. |
| HMAC-SHA256 webhook validation over static bearer tokens | Bearer token per webhook endpoint | HMAC signs the payload body. A captured signature is only valid for that specific request. Static tokens, once leaked, work for any request until rotated. The webhook secret doesn't travel over the wire. |
| In-memory LRU cache (256 entries, 300s TTL) over distributed Redis cache | Redis-backed cache for workflow definitions | Single-instance deployment means in-memory cache is faster and simpler. The 300-second TTL accepts up to 5 minutes of staleness on workflow updates. This becomes a gap at multi-instance scale: cache invalidation would need Redis pub/sub, which is the known architectural debt. |