So you got testing up and running, but what happens when a monitor fails a test ?
KNM tests a monitor according to its test interval setting defined in the monitor property page. This is normally between 30-60 seconds. Alarm generation and alarm test interval is two properties that come in to play when a monitor fail a test.
Alarm generation tells the monitor how many consecutive times it has to fail a test to go from ok state to alarm state. In between those two states there is an intermediate state called “failed” , the failed state is a mere visual indication that something may be about to happen.
Each of these states is represented by a color, green (ok), orange (failed) and red (alarm) as seen below.
Looking deeper at some of the settings; Alarm test interval is the test interval when the monitor is in alarm state. The alarm test rate is usually much lower than the normal test interval. The default value is 10 minutes (600 seconds) between the tests when in alarm state. The lower test rate is a precaution since each test uses a common resource called a “thread” and the number of threads are finite, it needs to test slower to make sure we do not run out of threads since a failed test may need to wait for a connection timeout that can be 20 seconds or more.
The beauty in this setup is not only that its easy to configure and understand, but it acts like a filter, requiring N number of consecutive failed tests will filter away temporary problems. For example, a short connection problem that would have, in other programs, generated a false alarm.
So if we set the alarm generation count to 5 and the test interval to 10, it will take 50 seconds for KNM to move the monitor from ok state to alarm state, and after that it will test the monitor every 10 minutes.
When KNM switches from the ok state to alarm state and every failed test after that, a counter called ‘alarm count’ is increased with one. The alarm count plays a central role when executing alarm action lists.
An action list is a series of actions, each numbered with an alarm count and is executed when the alarm count matches that of the monitors alarm count. This makes action lists work pretty much like any computer program, with a program counter (alarm count) indicating which instruction (action) it should execute next, that makes action lists very powerful.
The purpose of this is to be able to escalate the as time in alarm state progress, to either fix the problem itself or call for human help.
This can be overwhelming at first, states, parameters, lists, action and what not. So that is why we added a report called “simulate alarm”, it will tell you exactly when and what the action list will do in the context of a particular monitor. You can find the simulate alarm report on the monitor information page.
The action lists
There are two types of action lists but the only difference between them are in the way they are executed.
The alarm action lists are executed when the monitor is in alarm state, for each new alarm count the action in the action list with the same alarm count number will be executed. There can be one or more actions with the same alarm count number, then they are executed in sequence as they are presented in the action list information page.
The recovery action lists are executed when the monitor transient from alarm state back to ok state and here is the main difference between the 2 types of action lists. While the alarm action list is executed step by step according to the current alarm count, the recovery action list is executed top to bottom at once.
While the alarm action list present opportunities to do all sorts of escalation, the recovery lists main purpose is to perform actions to notify that the problems have been remedied.
There are a number of actions at your disposal when you create new actions lists that can be used in both alarm and recovery action lists:
- Clear event log
- Execute command via SSH2
- Execute Lua script
- Execute Windows command
- HTTP Get/Post
- List reset
- Net Send
- Paging via PageGate
- Send mail
- Send SMS
- SNMP Set
- Windows service control
The full action reference can be found here.
Making actions generic
Each action type requires different parameters, it can be the name of a Lua script (Lua script action) or what service to restart (Service action). By default the action will use the address of the object, but in some cases it may be useful to be able to insert formatting variables into fields to make the action list generic and usable for all monitors.
Example of formatting variables
- %object_destination – IP number or host name of object
- %object_name – name of object
- %monitor_name – name of monitor
The action reference section in the documentation describes what formatting variables that can be used with what action in detail.
To do automatic remediation that works in every possible scenario is difficult, even with a powerful system like Kaseya Network Monitor. Therefore its recommended that all action lists contains an e-mail or SMS notification action just in case, in combination with operator schedules the notifications can be limited to those are working, but more on that in another post.