When Manual Monitoring Breaks at 5,000 Readings Per Second

The Situation

Solar installations, temperature sensor arrays, industrial power meters. The hardware is cheap. The monitoring is not. A typical IoT deployment sends one reading per second per sensor per metric. Voltage, temperature, power. One site with 200 sensors generates 600 messages per second. Five sites hit 3,000. And at that volume, the monitoring process that worked at 10 sensors (check a dashboard, set a Grafana alert, wait for an email) fails silently. Alerts fire late. Batch inserts fall behind. The operator discovers the anomaly when a client calls, not when the system detects it.

Most teams build this pipeline with Python scripts, cron jobs, and InfluxDB. Then they spend months debugging message loss under load, rewriting the ingestion path in Go, and discovering that their alerting logic runs on a separate schedule from their data pipeline. The problem is not any single component. The problem is that sensor ingestion, time-series storage, anomaly detection, and alert routing are treated as separate systems when they need to be one.

The Cost of Doing Nothing

At 5 sites with 200 sensors each, a manual monitoring workflow requires at least one full-time operator per shift watching dashboards and responding to threshold breaches. That's roughly EUR 45,000 per year in labor for a single shift. Two shifts (covering business hours plus on-call) doubles it. The labor cost compounds with the response time: a missed voltage spike on a solar panel can mean 4 to 8 hours of undetected power loss before a technician is dispatched. At EUR 0.12/kWh for a commercial installation, 8 hours of downtime across 200 panels losing 1kW each costs roughly EUR 192 per incident. A facility averaging one incident per week loses EUR 9,984 per year in energy alone, on top of the monitoring labor.

Both numbers are dwarfed by the response latency. A system that detects anomalies within one second of threshold breach and routes alerts to the right escalation path collapses the time between detection and dispatch from hours to minutes. That's where the compounding savings live: fewer incidents escalate, fewer panels sit idle, fewer technicians drive out for problems that could have been caught at 6 AM instead of 2 PM.

What I Built

A single Rust binary that handles the entire pipeline: ingestion via NATS, batched writes to TimescaleDB, inline anomaly detection with configurable rules, and alert routing to downstream notification and workflow systems. No cron jobs. No separate alerting service. No glue scripts.

Sensor devices publish JSON readings to a NATS message broker. The binary's consumer loop deserializes each message, validates the data (no NaN, no empty metrics), resolves the tenant from an API key, and pushes the reading into a batch buffer. When the buffer hits 100 readings or 500 milliseconds pass, it flushes to TimescaleDB with a single bulk write. After each successful flush, the batch passes through the anomaly detector, which evaluates threshold rules and statistical deviation checks, enforces cooldown windows to prevent alert storms, and fires alerts to the Notification Hub and Workflow Engine.

The hardest part was making detection non-blocking. The first version ran the anomaly evaluator synchronously after each batch insert. When the Workflow Engine was slow to respond, the entire ingestion pipeline stalled. I redesigned it so every ecosystem call spawns a fire-and-forget Tokio task. The consumer loop never waits for a downstream HTTP response. If the Notification Hub is down, the alert is still persisted in the database. The notification is lost, not the alert.

System Flow

Data Model

Architecture Layers

The Decision Log

Decision	Alternative Rejected	Why
Rust over Go for the ingestion binary	Go with goroutines	At 5,000 messages/second, garbage collection pauses compound with batch flush timing. Rust's zero-cost async on Tokio eliminates that variable. The binary fits in 512MB of memory.
TimescaleDB over InfluxDB	InfluxDB 2.x	Needed JOIN capability for alert rule lookups. TimescaleDB gives continuous aggregates and retention policies on standard PostgreSQL 16 SQL. No proprietary query language.
NATS over Kafka	Apache Kafka	Single-node deployment. Kafka's partition model and ZooKeeper dependency add operational weight the system doesn't need. NATS fits in 256MB.
Batch UNNEST over per-row INSERT	Individual inserts	100 readings in one database round-trip instead of 100. The buffer swap releases the lock before the INSERT runs, keeping lock hold time under a microsecond.
DashMap + argon2 over JWT	JSON Web Tokens	Machine-to-machine IoT traffic with no browser and no token refresh. The hot path is an O(1) hash map lookup. argon2 only runs on cache miss.
Fire-and-forget for ecosystem calls	Synchronous HTTP with retry queues	Notification and workflow calls spawn a Tokio task and return immediately. At-most-once delivery. The ingestion pipeline must never block waiting for a downstream response.

Ecosystem Integration

Alert events flow into a notification hub I built for exactly this pattern. When the anomaly detector fires a critical or warning alert, it constructs an event payload (device ID, metric, value, threshold, severity) and POSTs it to the hub's event ingestion endpoint. The hub handles channel routing: email, Telegram, in-app. The telemetry engine doesn't know or care how the operator gets notified.

Critical alerts also trigger the workflow engine via an HMAC-signed webhook. The engine can run multi-step escalation DAGs: notify on-call, wait 10 minutes, escalate to the operations manager, create a maintenance ticket. The telemetry engine fires the webhook and moves on. Escalation logic lives in the workflow engine, not in the sensor pipeline.

Both integrations are feature-flagged. The telemetry engine runs standalone with no ecosystem dependencies. Alerts are always persisted in the local database regardless of whether the hub or workflow engine are reachable. The external connections add operator convenience (push notifications, escalation workflows), not correctness.

Results

Before: manual monitoring with dashboard watching and Grafana alerts that fire on a schedule. Threshold breaches discovered when clients call, minutes to hours after the event. After: anomaly detection fires within one second of threshold breach. Alert routing reaches the operator's phone via the Notification Hub within seconds. No manual dashboard watching required.

The system sustains 5,000 readings per second on a single Rust binary running in 512MB. TimescaleDB handles the time-series storage with automatic 1-minute and 1-hour rollup aggregates. Raw readings are retained for 30 days; aggregates are kept indefinitely. The query API automatically selects the right data source based on the requested time range: raw readings for ranges under an hour, minute aggregates for 1 to 24 hours, hour aggregates beyond that.

The weak point I'd address first: the fire-and-forget pattern means lost notifications are invisible. If the Notification Hub is down for ten minutes, the operator doesn't know those alerts were never delivered unless they check the Hub's event log. An outbox table (persist the notification payload alongside the alert, then deliver with retries from a background worker) would make notification delivery observable. That's complexity I chose not to add at current scale, but it's the first thing that needs building before this system runs a facility where the on-call engineer needs to trust that critical alerts always reach them.

Sensor Telemetry Engine