I Spent Years Building Zabbix Triggers. Then Zabbix 7.0 Changed the Question

 


  Every unusual behavior I've ever seen in production is encoded somewhere in my Zabbix trigger list.


  The device that starts reconnecting every 30 seconds instead of every 2 minutes — there's a trigger for that. The traffic spike that appears when an onboard application enters an error loop — trigger. The connection pattern that

  changes subtly two hours before a device goes completely offline — I noticed that one after the third incident, built the trigger, and never got caught by it again.


  I manage over 4,000 devices installed on public transport vehicles across a city network. The trigger list I've accumulated over years of watching this fleet misbehave in every possible way is, in some sense, a catalog of everything

  that has ever gone wrong. Every entry represents an incident, an observation, and a decision to never be surprised by that particular thing again.


  The problem is the things I haven't seen yet.


  ---

  What Manual Triggers Can't Do


  Threshold-based monitoring is fundamentally reactive knowledge. You build a trigger when you understand a failure mode. The trigger catches recurrences of that failure. It tells you nothing about failure modes you haven't encountered.


  For a fleet of 4,000 devices with intermittent connectivity — vehicles moving through coverage gaps, powering down at depots, operating on variable network conditions throughout the day — the space of possible anomalies is large. Rush

   hour creates different traffic patterns than overnight. Weekday behavior differs from weekend. A device on a cross-city route behaves differently from one on a short urban loop.


  Encoding all of that variation into static thresholds would require not just triggers, but triggers that account for time of day, day of week, route type, and seasonal variation. That's not a trigger list. That's a rules engine built

  by hand, maintained by hand, and wrong every time the fleet or the network changes.


  Zabbix 7.0 introduced a different approach to this problem.


  ---

  Baseline Detection: Learning Normal Before Flagging Abnormal

  

  The core concept in Zabbix 7.0's anomaly detection is a shift in what the system measures. Instead of asking "is this value above threshold X?", it asks "is this value significantly different from what it normally is at this time?"


  Zabbix implements this through the baseline_avg() and baselinedev() functions, available in trigger expressions. These functions calculate the historical average and standard deviation for a metric — accounting for time of day and day

   of week — and return values that reflect learned normal behavior rather than fixed limits.


  A trigger built on these functions looks fundamentally different from a conventional threshold trigger. For connection frequency on a vehicle device, it might read:


  avg(/Vehicle Host/net.if.in[eth0],5m) >

    baseline_avg(/Vehicle Host/net.if.in[eth0],"1w:now/d") +

    3 * baselinedev(/Vehicle Host/net.if.in[eth0],"1w:now/d")


  This fires when current traffic is more than three standard deviations above the baseline for this time of day, calculated from the past week of data. The threshold adapts. Rush hour has a higher baseline than 3 AM. Monday differs

  from Sunday. The trigger knows this because the data knows this.


  For connections and traffic monitoring on a mobile fleet, this matters enormously. A device sending twice its normal data volume at 2 AM is a different signal than the same device sending twice its normal volume at 8 AM peak hours. A

  static threshold treats them identically. A baseline trigger treats them appropriately.


  ---

  The Practical Setup for a Vehicle Fleet

  

  Applying anomaly detection to a fleet at this scale requires some architectural decisions.


  Which metrics to baseline first: Not everything benefits equally from anomaly detection. Metrics with consistent patterns — connection frequency, data volume per session, reconnection intervals — are good candidates. Metrics that are

  already well-covered by existing triggers are lower priority. For vehicle devices, inbound and outbound traffic per connection session and the interval between connections are the highest-value targets.


  Historical data requirements: Baseline functions need sufficient history to learn meaningful patterns. Zabbix recommends at least two to four weeks of data before baseline triggers produce reliable results. For a new deployment, this

  means running data collection before enabling baseline-based alerting — the system learns before it judges.


  Complementing existing triggers, not replacing them: The triggers built from years of observed failures remain valuable. They catch known failure modes immediately, without waiting for baseline deviation. Anomaly detection catches the

   unknown — the novel failure, the new behavior pattern, the thing that hasn't happened before. The two approaches cover different parts of the risk surface.


  Tuning the sensitivity: Three standard deviations is a reasonable starting point but not a fixed rule. For a high-noise metric with natural variability, tighter thresholds generate false positives. For a stable metric where any

  deviation is significant, two standard deviations might be appropriate. The tuning process mirrors what experienced sysadmins do with conventional thresholds — but the baseline moves with the data instead of requiring manual

  adjustment as conditions change.


  ---

  What Changes, and What Doesn't

  

  Years of manually built triggers represent accumulated operational knowledge. That knowledge doesn't become less valuable when anomaly detection is available — it becomes the foundation on top of which a second layer of detection

  operates.


  The triggers I've built catch what I've seen. Baseline detection catches what I haven't. Between them, the coverage is qualitatively different from either alone.


  There's also a more subtle benefit: anomaly detection surfaces the questions. When a baseline trigger fires on a metric that has no conventional threshold trigger, it doesn't always indicate a problem. Sometimes it indicates something

   worth understanding — a new traffic pattern, a configuration change that propagated unevenly, a change in vehicle routing that affected connectivity behavior. The alert is the beginning of an investigation, not necessarily an

  incident.


  For a fleet that operates continuously, changes constantly, and generates more behavioral data than any threshold list can fully capture, that's a meaningful capability.


  The trigger list keeps growing. Now it has help.


  ---

  This article was written with the assistance of an AI writing program.



Comments

Popular posts from this blog

Zabbix on Linux: The Monitoring Setup Most SysAdmins Overlook

Solar Cycle 25 Has Peaked. Here's Why That's Actually Good News for 40m and 20m Operators.

11,000 Kilometers on a Wire I Built from Fence Insulators