Architectural Brief: Clinical Scheduling Engine
Five providers, eight rooms, six appointment types. Each with different availability windows, buffer requirements, daily caps, and equipment needs. A single "find available slots" request has to evaluate all of these constraints simultaneously, compute every valid combination, score them, and return a sorted list. The performance target: under 500 milliseconds.
The system doesn't store pre-generated slots. It computes them from constraints on every request.
System Topology
Infrastructure Decisions
-
Language and Framework: Python 3.12 with FastAPI. Chose over Django because this is a single-purpose scheduling API with no admin panel, no server-rendered templates, no form handling. FastAPI's native async support lets the optimizer run multiple database queries concurrently during slot computation.
-
Data Layer: PostgreSQL 16 with SQLAlchemy 2.x (async mode) and asyncpg. Chose over SQLite because the double-booking prevention depends on database-level UNIQUE constraints under concurrent writes. SQLite's file-level locking serializes all writes; PostgreSQL handles 10 simultaneous INSERT attempts for the same slot and lets the database pick the winner.
-
Background Jobs: APScheduler running inside the FastAPI process. Chose over Celery because the system has exactly two scheduled tasks: marking no-shows and computing stats. Both are single-query operations running hourly. Celery would add a message broker (Redis or RabbitMQ), a separate worker process, and monitoring infrastructure for work that takes under a second per run.
-
Caching: In-memory TTL cache with 60-second expiry for provider availability windows. Chose over Redis because this is a single-process deployment. Adding Redis for one cache key introduces a new container, a connection pool, and a failure mode. The cache stores availability windows (which change weekly), not computed slots (which change with every booking).
-
Deployment: Multi-stage Docker build with a non-root production user. PostgreSQL and the app container sit on an internal network. Traefik handles TLS termination with automatic Let's Encrypt renewal. Chose Traefik over Nginx because Traefik reads Docker labels for routing, so adding a new service means adding container labels rather than editing Nginx config files and reloading.
-
Event Emission: Fire-and-forget HTTP to the Notification Hub. Chose over synchronous webhooks or a message queue because a booking should never fail due to an external service outage. The hub client wraps all errors in a warning log and returns. Booking operations complete regardless of notification delivery.
Constraints That Shaped the Design
- Input: A target date, an appointment type UUID, and optionally a provider filter and preferred time window. Everything else (availability, rooms, bookings, overbooking rules) is loaded from the database.
- Output: Scored, sorted candidate slots. Each slot includes provider, room, time range, quality score, and an overbooking flag. When no slots exist for the requested date, the system scans forward up to 14 days and returns the next available date.
- Scale Handled: 5 providers, 8 rooms, and a typical day of bookings. Slot computation completes in under 500 milliseconds (verified by automated performance benchmark). At 50 providers with 100 rooms, the per-provider query pattern would need batching, and the in-memory cache would need to move to Redis for multi-process deployments.
- Hard Constraints: Double-booking prevention enforced by two PostgreSQL UNIQUE constraints: one on
(provider_id, date, start_time)and one on(room_id, date, start_time). Under a concurrent booking stress test, 10 parallel requests for the same slot produce exactly 1 success and 9 clean SLOT_UNAVAILABLE errors. - Rate Limits: Slot queries at 200 requests per minute, booking creation at 100/min, cancellations at 50/min, stats at 60/min. Enforced by an in-memory sliding-window rate limiter at the middleware layer.
Decision Log
| Decision | Alternative Rejected | Why |
|---|---|---|
| Real-time slot computation on every request | Pre-generated slot table with sync jobs | A stored slot table needs constant synchronization with every booking, cancellation, availability change, and overbooking rule update. One missed sync produces phantom slots or hides real ones. Computing from constraints means the answer is always current. |
| Database UNIQUE constraints for concurrent booking safety | Application-level mutex or distributed lock via Redis | Race conditions in application locking are subtle and difficult to reproduce in tests. With database constraints, two INSERT statements hitting the same provider+date+time let PostgreSQL decide the winner. No distributed coordination required. |
| 4-weight quality scorer (40% preference, 35% gap, 15% room, 10% overbook) | Sort by earliest available time | Time-only sorting ignores patient preferences, creates scheduling gaps in the provider's day, and treats overbooked slots the same as normal ones. The weighted scorer balances patient convenience against clinic operational efficiency. |
| Fire-and-forget notification events | Synchronous webhook or message queue | Booking operations must never fail because the notification service is down. The hub client swallows all exceptions. A 5-second HTTP timeout prevents blocking. Lost notifications are logged as warnings, not retried. |
| APScheduler in-process | Celery task queue with Redis broker | Two hourly jobs: no-show marker and stats calculator. Both run a single database query. Celery would add three new components (broker, worker, monitoring) for work that completes in under a second. |
| Single Alembic migration for all 6 tables | Incremental migrations per entity | The schema was designed upfront from the PRD specification. All 6 tables were created in one migration. No schema changes have been needed since initial deployment. |