Monitoring 4,000 Devices That Are Never Fully Online: Zabbix 7.0 and the Public Transport Challenge
Most Zabbix deployments have one assumption baked in from the start: the monitored hosts are online. They're servers, network equipment, virtual machines — infrastructure that sits in a rack and stays connected. The monitoring logic,
the alerting thresholds, the availability calculations — all of it assumes a host that goes offline has a problem.
What happens when going offline is normal?
I manage a Zabbix 7.0 LTS deployment monitoring over 4,000 devices installed on public transport vehicles — buses, trams, and metro cars. These devices don't stay connected. They pass through tunnels, move in and out of cellular
coverage, power down at the depot, and reappear on the network hours later. A standard Zabbix setup treats every one of those disappearances as a crisis. Getting this right required rethinking how Zabbix handles availability at scale.
---
The Problem with Passive Monitoring on Moving Targets
Zabbix operates in two fundamental modes for data collection. In passive mode, the Zabbix server polls the agent — it reaches out, asks for data, and expects a response. In active mode, the agent initiates the connection, buffers
collected data locally, and pushes it to the server when connectivity is available.
For static infrastructure, passive mode is fine. For devices on vehicles, passive mode generates an unworkable volume of false alerts. The server polls a device on a bus in a tunnel, gets no response, and logs an unavailability event.
Multiply that by 4,000 devices moving through coverage gaps throughout the day, and the problem queue becomes noise.
The first architectural decision was straightforward: all vehicle devices run Zabbix Agent 2 in active mode. The agent buffers locally when the vehicle loses connectivity — GPS position updates, onboard system health metrics, door
sensor states — and pushes the accumulated data to the server the moment the network is available again. No polling failures. No false unavailability events during tunnels or coverage dead zones.
---
Zabbix Proxy: The Depot Advantage
Active agents solve the collection problem, but at 4,000 hosts, the Zabbix server handles significant inbound connection volume when vehicles return to depot simultaneously after a night run.
Zabbix 7.0 significantly improved proxy performance and management, and the proxy architecture is central to how this deployment scales. Each major depot runs a local Zabbix Proxy. Vehicles connect to the nearest proxy rather than
directly to the central server. The proxy buffers and forwards data, smoothing out the spike when an entire fleet reconnects after overnight parking.
In Zabbix 7.0, proxy configuration and monitoring was consolidated — you can now see proxy health, buffer status, and data lag directly from the main server interface without separate tooling. At this scale, that visibility matters. A
proxy that's fallen behind on forwarding data is a silent problem; in 7.0, it's a visible one.
---
Tuning Availability: When "Offline" Is Not an Incident
The default Zabbix behavior marks a host unavailable after a handful of missed polls. For a bus going through a 90-second tunnel, that threshold triggers an event, an alert, and a notification — all of which are meaningless.
Zabbix allows tuning of the unreachable delay and unreachable count parameters — how long and how many missed contacts before a host is considered unavailable. For vehicle devices, these values need to reflect the actual offline
patterns of the fleet. A device that doesn't report for 15 minutes might be at a terminus with patchy signal. A device that hasn't reported for 4 hours when it should be on a route is a genuine incident.
Getting these numbers right requires real data. The first month of this deployment was largely an exercise in observing actual offline durations, mapping them to route geometry and coverage maps, and tuning thresholds accordingly.
Zabbix's historical data made that analysis possible. The result is a setup where genuine failures surface clearly and expected offline periods generate no noise.
---
Problem Suppression for Scheduled Downtime
Zabbix 7.0 refined problem suppression — the ability to silence alerts for a host or group during a defined window without creating a full maintenance period.
For the overnight depot window, when vehicles are powered down for inspection and the devices are intentionally offline, suppression rules tied to a schedule keep the problem queue clean. The hosts still exist, the availability data
still records the offline period, but operations staff aren't receiving alerts about expected downtime at 3 AM.
The distinction between suppression and maintenance is subtle but important at scale. Maintenance periods in Zabbix stop data collection entirely. Suppression keeps collecting and recording — the data is there for analysis — but holds
the alerts. For a fleet where understanding actual overnight behavior has value, that distinction matters.
---
What 4,000 Moving Hosts Teach You About Zabbix
A deployment at this scale, with this kind of connectivity pattern, surfaces assumptions that smaller or more conventional setups never encounter. Zabbix 7.0 handles it well — not out of the box, but with deliberate configuration.
Active agents over passive polling. Proxies at collection points instead of direct server connections. Availability thresholds tuned to real-world offline patterns rather than defaults. Problem suppression for predictable downtime
instead of alert fatigue.
The monitoring logic that works for a server room doesn't work for a bus fleet. The tools are the same. The thinking has to be different.
---
This article was written with the assistance of an AI writing program.

Comments
Post a Comment