[Nagiosplug-devel] RFC: New threshold syntax
Vonnahme, Nathan
nathan.vonnahme at bannerhealth.com
Tue Apr 1 00:55:57 CEST 2008
> From: nagiosplug-devel-bounces at lists.sourceforge.net
[mailto:nagiosplug-
> devel-bounces at lists.sourceforge.net] On Behalf Of Matthias Eble
> > Ton could imagine some helper functions (cmdline, web pages, google
> > calculator) to verify complex thresholds
>
> That could also be part of the library so every plugin could have a
> dryrun option to print which values would cause what. Based on the
> defined thresholds, (for example x:y) one could test/print what rc the
> values x,y,x+1,x-1,y+1,y-1 would cause.
>
I *really* like that idea, Matthias! It might be tricky for plugins
like check_procs or check_http that are checking multiple parts of
another program's output.
> > - Is it necessary to allow multiple ranges per thresh
warn=10:20,50:60?
>
> The Performance data definition doesn't permit this up to now but I
> could imagine some people would like to see this.
I'd say "not necessary" -- there are workarounds (like using inversion,
or two checks) for the few cases where you'd want this, and supporting
it would be complex.
> > - Should thresholds be defined ok/warn rather than warn/crit?
>
> I like the approach but this means not only the syntax is changed.
> People need to start thinking when converting.
We don't need to abandon or break the warn/critical options, although at
some distant point it might be good to move away from the -W and -C
syntax.
Specifying "normal" and flagging exceptions instead of trying to
"enumerate badness" is a good practice in many areas (testing, security,
quality control).
I think it's also how sysadmins think about their systems, right? "On
this machine, the disk is normally 50-80% full." If you only think in
terms of warn/critical, you might only think about the upper boundary,
and have alerts when usage goes over 80%. But if you specify "OK", you
may get an unexpectedly valid alert one day when your disk is suddenly,
mysteriously, 1% full :)
(actually that's also why check_disk's "free space" (instead of used
space) approach has often confused me, though I can see several good
reasons for it)
> > - Should there be an explicit range limit (10:inf over 10:)
>
> 10:inf or 10::inf looks cleaner to me.
I am always in favor of explicit (10:inf or 10..inf), because it
optimizes reading, which you do more often than writing, and because
newcomers read examples before they write.
> > - Is it favorable to have multiple range styles like
> > 1<x<10 *and* 1:10 *and* ... in parallel?
>
> Not if you ask me.
Agreed! And Thomas is right-- if you hate the supported syntax you can
always write a script or utility to run the plugins or generate the
options for you. The extremely lazy typists out there can also probably
use various macro-like utilities to overcome any gratuitously explicit
characters :)
> Since it looks like the default alerting mechanism will be "inside",
> default range behaviour for plain numbers (X gets 0:X) should be
> reversed, too. So X will result in X:inf instead of 0:X
> Or should we drop those plain thresholds completely?
I'd like to see plain thresholds go away eventually, because in some
existing cases X means 0:X or -inf:X and in others it means X:inf.
Also, I think it's important to get users thinking in terms of ranges
rather than single numbers.
> What about mixing uom-prefix in one range? Might this be needed in the
> future?
-1
> At the moment, my favourite threshold/range definition is following:
> --throughput ok=1..5/M,warn=1..300/M/B
Let's check the readability of some examples:
check_http https://foo.com
--time ok=0..5/s,warn=5..10/s \
--size ok=3..5/kB \
--ssl-expiry ok=28..inf/d,warn=14..28/d
old: check_procs -w 8096 -c 16182 -C httpd --metric VSZ
new: check_procs -C httpd --vsize ok=0..8096,warn=8096..16182
old: check_procs -w 6:13 -c 4:18 -u mqm -a AKBLD
new: check_procs -u mqm -a AKBLD --count ok=6..13,warn=4..18
(I'm not sure whether that overlapping warn range would work)
old: check_procs -w 1:1 -c 1:1 -C tnslsnr
new: check_procs -C tnslsnr --count ok=1..1
More information about the Devel
mailing list