[Nagiosplug-devel] check_disk enhancements
Ton Voon
ton.voon at altinity.com
Mon Jul 17 12:25:50 CEST 2006
On 16 Jul 2006, at 04:07, John P. Rouillard wrote:
> In message <C35C4EDC-2D0E-47B1-85AF-A5311CEAEF92 at altinity.com>,
> Ton Voon writes:
>> The biggest problem that I've discovered is that the range
>> specification for -w and -c are inverted from the norm. This was
>> noticed when using the library range checking routines. check_disk -w
>> 10% means alert if freespace is below 10%, but we normally mean to
>> alert if it is outside the range. So, for instance, check_procs -w
>> 1:1 means alert if greater than 1 process.
>
> Specifying the freespace always seemed weird to me. If we defined the
> used space, it would work better with the -w and -c settings.
>
> -w 80 (-w 0:80) - warn if more than 80% of the disk used.
> -c 90 (-c 0:90) - critical if more than 90% of the disk used.
>
> However this would be an incompatible change to the command line that
> doesn't look different from the pre-existing calling format, so it's
> out unless we implement a flag to request this as Gavin said above.
Hands up for guilt. I made a change in 1.4 (?) to make it more
consistent, but obviously brain was not working correctly. It's only
now since I'm trying to use a general library routine for parsing
thresholds that I've realised it is wrong.
>
>> I've got a hack for check_disk (forcing a @ at the beginning of the
>> range, which means to alert inside), but I was wondering if we should
>> introduce a new way of defining thresholds. I'm thinking something
>> like:
>>
>> --freespace="0:5;0:2" (warn if outside 0 to 5, crit if outside 0
>> to 2)
>> --usedspace_percent=";90:100" (no warn, crit if outside 90 to 100)
>> --usedinode="100:;200:" (warn if outside 100 to infinity, crit if
>> outside 200 to infinity)
>>
>> This also matches with perfdata output.
>
> Just a nit first, would the new way be in addition to the old way (-w,
> -c), or replace the old way entirely and report an error if somebody
> tries to use it? I think in addition to is the best for backwards
> compatibility.
I'm going to try and retain the old syntax. However, there's only so
much that can be supported backwards. There are some plugins which
still try to retain backwards compatibility from the Netsaint days.
My feeling is this (and please shout out if you think I am too harse/
soft): If there are unit tests for the old syntax, and it is not too
much development work to retain and it doesn't break anything we want
to do moving forward, then we'll support the old syntax. Otherwise,
we'll make a note in CHANGES that it will break in a future version,
and then break it.
>
> The -w and -c flags work well if the plugin is only testing for one
> parameter. However a lot of plugins test for multiple parameters. I
> have a couple of home grown plugins that test 10 different parameters
> because the overhead of getting the data is so large that calling the
> program 10 times to just extract a single data item is nuts.
>
> In other cases there can be multiple tests to perform against the data
> from the command and they must all be done at once because the data
> needs to be synchronized for the tests to be meaningful. Using
> tkwatcher <http://www.cs.umb.edu/~rouilj/tkwatcher/> I had some
> instances where there were 30 tests on the output of a single command
> stream. I agree that the current -w -c -W -C threshold setting
> mechanism's don't cut it. So I think something like what you propose
> is needed. I would extend it just a bit however to allow each
> threshold to specify:
>
> warn_list;crit_list
>
> where warn/crit_list is:
>
> warn_list/crit_list range|single[,range|single]
>
> where single is a degenerate form of range implying 0:single just as
> with the current plugins. This way we can support upper and lower
> warning limits. E.G: warn if in the range 10-20 or 80-90, crit if in
> the range 0-10 or 90-100:
>
> --freespace 0:10,20:80,90:100;10:90
>
*cough, cough* [tea flies onto keyboard]. Haven't even finished this
general threshold and John's given me a new requirement!
I think multiple warn/crit ranges are doable. It will look messy, but
I guess this is a fairly advanced option. The biggest problem for me,
is how to specify it on the command line.
A ";" as the range separator is going to cause trouble because Nagios
does not easily allow ";" to get passed to the command line. I'd
prefer not to use quotations because Nagios will need to invoke a
shell to parse, whereas it currently just calls the executable.
I'm thinking maybe the severity separator should be "/".
> This would also work for those cases where we need to exclude the
> middle of a range e.g. when checking discrete values from
> snmp. E.G. 1,2,4 are warning but 3,5 are critical:
>
> --thresh 3:3,5:;1:2,4:4
When I first saw this, I thought the first threshold was the crit
one. Then I thought: crit first makes sense as the crit severity
would be checked first. However, this would conflict with current
performance data. Thoughts?
This would not work because we'd do an "OR" for the ranges. So 4 is
outside 1:2, so crit would be given. A better way would be (with crit
level first):
--thresh @3, at 5/0:
The syntax looks awful, but I don't know how else it can be done,
without some Nagios object type stanza definition. I'm thinking a -v -
v -v will print out a human readable version of the ranges and when
they would be triggered.
> Quips, comments, evasions, questions, answers or suggestions
> welcome. Although I have to say coding my standard parser for shell
> script to deal with the current threshold processing was a bear. This
> enhanced form may be worse.
I find examples to be most illuminating. Looking at the basic use of
check_disk, this simplifies nicely to:
--usedspace_percent=90/80
for critical if usedspace is above 90%, warning if above 80%. Or even:
--usedspace_percent=90
for critical above 90 only.
Other examples:
--freespace_units=@600/@400 - critical if between 0 to 600 units,
warning if between 0 to 400 (warning will never appear - should this
be flagged as an error?)
--usedspace_percent=/30: - warn if used space is less than 30%. No crit
--freespace_percent=
I'm also thinking that the plugin output and performance data depends
on the thresholds specified. If you run:
./check_disk --usedspace_percent=90/80 --freeinodes=@0:1000 -p /
Then you will get output like:
DISK CRITICAL - used space: / 95% (freeinodes=451658);| /=95%;
0:80;0:90;0;100 i-/=451658;;@0:1000
(The i-/ means inodes for /. I'm guessing I don't need to prefix
anything to /).
However, if you run:
./check_disk --freespace_units=@10/@5 --usedspace_percent=90/80 --
units=GB
You'll get output like:
DISK WARNING - free space: / 7GB (usedspace=82%);| /=7GB;@0:5;@0:10;0;62
(Only show one set of perf data, depending on what gets specified
first on command line)
I think a major blockage to understanding of the thresholds is that,
intuitively, you want to alert if a value falls INTO a range, rather
than OUTSIDE of the range. So I keep reading that
./check_disk --freespace_percent=10
as "alert if freespace is less than 10 percent", when in fact, it
means "alert if freespace is outside 0 to 10 percent". And I HELP
DEFINE THESE DAMN THINGS! I think this comes from the check_proc
range where
./check_procs -c cron -c 1:1
alerts if there is not 1 process. Is this worth breaking?
Good discussion. I'll try and keep summarising as more opinions come in.
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://nagios-plugins.org/archive/devel/attachments/20060717/1cccf8b3/attachment.html>
More information about the Devel
mailing list