Overview
A manufacturing team depended on field telemetry to track equipment behavior and trigger maintenance actions. During network churn, data arrived late, duplicated, or not at all.
Challenge
Payload formats had drifted across firmware generations. Backend jobs treated retries as new events, so dashboards and alerts showed conflicting state during incidents.
Approach
We traced one critical event path end to end, from device publish to operator alert. Then we introduced versioned payload contracts, idempotency keys, and explicit retry/dead-letter handling. Rollout happened site by site with rollback checks.
Architecture
The revised pipeline ran through managed ingestion, normalization workers, and deterministic deduplication before events reached alerting and dashboards. Processing stages emitted trace IDs so support could follow one event across the system.
Outcome
During carrier outages, backlog draining became predictable and alert recovery no longer depended on manual replay. Operators trusted live state again during active faults.
Lessons
In connected products, transport behavior and payload contracts are core architecture. If they drift, operational trust collapses fast.