[Nagiosplug-devel] RFC: new style command arguments for thresholds
Ton Voon
ton.voon at altinity.com
Fri Jan 12 15:52:46 CET 2007
Hi!
I'm canvassing opinions for this change to the developer guidelines
re: command arguments to thresholds. I first brought this up at the
Nagios Conference in Germany (http://www.netways.de/de/
nagios_konferenz/archiv_2006/programm/nagios_plugins/), but want to
make sure there is a consensus in this mailing list.
BACKGROUND
There are three main problems:
1) when you have a check that wants to check multiple "things", the
syntax is confusing. For example, free disk space in check_disk is -
w/-c (in units or percent), but inode checking is -W/-K. In
check_http, -w/-c is for time taken, -m is for page size. This is not
very readable and inconsistent
2) the output and performance data is inconsistent with what is being
checked. For instance, if I check my disks for inodes, I don't
necessarily want perf data returned about disk free. This clogs up my
graphs and muddies my output
3) I've started using common routines for threshold parsing and found
that the way that parsing occurs between plugins is inconsistent. For
instance, check_procs -c 1:1 means "critical if not 1 process".
However, check_disk -c 5% means "critical if between 0 and 5%".
Worse, the way the guidelines define ranges so the default is to
alert outside a range, which looks wrong.
I did this test to the audience at the Nagios Conference. Given a
command 'check_stuff -w 30:50 -c 10:30' where the result of "stuff"
is 15, what is the alert level raised?
Go on, have a guess!
The answer is Warning. I had two guesses of "Critical" by the crowd
and I think this is because you immediately assume an alert
**within** the range, not outside. I think this needs fixing.
PROPOSAL
So my proposal is to have a different, but complementary, method of
specifying thresholds:
--metric=crit/warn
The crit and warn ranges are defined as min:max (max is optional,
defaults to +infinity). Alert if the checked value is inside this
range. If you want to alert on the outside of this range, prefix the
range with a carat sign (^).
Crit or warn can be blank, meaning no alert to be specified for that
alert level.
If the metric is specified, then output + perfdata will reflect. Eg,
check_http --page_size=60K/40K --document_age=5s/3s will give output
of the document age and the page size, but not the certificate age or
the time taken. If you want output and perfdata without checking the
result, specify the metric without any values, eg check_http --
certificate_age.
I think the metric name should be composed of alphanumerics and
underscore only, so it can map to RRD names. If there is a many-to-
many mapping (eg, check_disk, looking at per mountpoint), use a key
prefixed at the beginning with a separating colon, eg check_disk --
disk_free=2GB --inode_used=/0:500 -p / -p /var would have perf output
of:
/:disk_free=1.3GB;;2 /:inode_used=433;0:500; /var:disk_free=0.7GB;;2 /
var:inode_used=700;0:500;
Whatever processes the perf data can decide how to use the prefix
(save to a separate RRD?).
COMPLICATIONS
As this is a new command syntax I can see this being acceptable, as
long as the old syntax still works correctly. However, the
performance data part will be a problem to current parsers since I'd
like to redefine the meaning of warn and crit.
One option is that the new perf data is outputted in XML format. This
might help with structural changes in future. This also ties in with
a request from Gerd Muller of Netways at NagConf where he wanted some
metadata re: the plugin to be available (name=check_disk version=1.80).
Any opinions?
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://nagios-plugins.org/archive/devel/attachments/20070112/5767675e/attachment.html>
More information about the Devel
mailing list