[Nagiosplug-devel] RFC: new style command arguments for thresholds

Ton Voon ton.voon at altinity.com
Fri Jan 12 15:52:46 CET 2007


Hi!

I'm canvassing opinions for this change to the developer guidelines  
re: command arguments to thresholds. I first brought this up at the  
Nagios Conference in Germany (http://www.netways.de/de/ 
nagios_konferenz/archiv_2006/programm/nagios_plugins/), but want to  
make sure there is a consensus in this mailing list.


BACKGROUND

There are three main problems:

1) when you have a check that wants to check multiple "things", the  
syntax is confusing. For example, free disk space in check_disk is - 
w/-c (in units or percent), but inode checking is -W/-K. In  
check_http, -w/-c is for time taken, -m is for page size. This is not  
very readable and inconsistent

2) the output and performance data is inconsistent with what is being  
checked. For instance, if I check my disks for inodes, I don't  
necessarily want perf data returned about disk free. This clogs up my  
graphs and muddies my output

3) I've started using common routines for threshold parsing and found  
that the way that parsing occurs between plugins is inconsistent. For  
instance, check_procs -c 1:1 means "critical if not 1 process".  
However, check_disk -c 5% means "critical if between 0 and 5%".  
Worse, the way the guidelines define ranges so the default is to  
alert outside a range, which looks wrong.

I did this test to the audience at the Nagios Conference. Given a  
command 'check_stuff -w 30:50 -c 10:30' where the result of "stuff"  
is 15, what is the alert level raised?

Go on, have a guess!

The answer is Warning. I had two guesses of "Critical" by the crowd  
and I think this is because you immediately assume an alert  
**within** the range, not outside. I think this needs fixing.



PROPOSAL

So my proposal is to have a different, but complementary, method of  
specifying thresholds:

--metric=crit/warn

The crit and warn ranges are defined as min:max (max is optional,  
defaults to +infinity). Alert if the checked value is inside this  
range. If you want to alert on the outside of this range, prefix the  
range with a carat sign (^).

Crit or warn can be blank, meaning no alert to be specified for that  
alert level.

If the metric is specified, then output + perfdata will reflect. Eg,  
check_http --page_size=60K/40K --document_age=5s/3s will give output  
of the document age and the page size, but not the certificate age or  
the time taken. If you want output and perfdata without checking the  
result, specify the metric without any values, eg check_http -- 
certificate_age.

I think the metric name should be composed of alphanumerics and  
underscore only, so it can map to RRD names. If there is a many-to- 
many mapping (eg, check_disk, looking at per mountpoint), use a key  
prefixed at the beginning with a separating colon, eg check_disk -- 
disk_free=2GB --inode_used=/0:500 -p / -p /var would have perf output  
of:

/:disk_free=1.3GB;;2 /:inode_used=433;0:500; /var:disk_free=0.7GB;;2 / 
var:inode_used=700;0:500;

Whatever processes the perf data can decide how to use the prefix  
(save to a separate RRD?).



COMPLICATIONS

As this is a new command syntax I can see this being acceptable, as  
long as the old syntax still works correctly. However, the  
performance data part will be a problem to current parsers since I'd  
like to redefine the meaning of warn and crit.

One option is that the new perf data is outputted in XML format. This  
might help with structural changes in future. This also ties in with  
a request from Gerd Muller of Netways at NagConf where he wanted some  
metadata re: the plugin to be available (name=check_disk version=1.80).


Any opinions?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://nagios-plugins.org/archive/devel/attachments/20070112/5767675e/attachment.html>


More information about the Devel mailing list