[Nagiosplug-devel] RFC: New threshold syntax
Matthias Eble
matthias.eble at mailing.kaufland-informationssysteme.com
Fri Mar 28 16:19:18 CET 2008
Hi all,
after reading and thinking about this thread for hours, now, I want to
summarize the discussion up to now. I hope I got the most important
statements and met the participants' intentions.
So here, we go:
Max suggested:syntax like
-w '1min>15:5min>5' -c '15min>15:5min>10'
Which has downsides:
- thresholds need to be quoted properly (no problem for the people on
this list, but annoying anyway)
- it's much harder to read than using longopts: --load1=/15:
--load5=10:/5: --load15=15:
One general aim is that the threshold specification should be as
flexible as possible also to prevent the need to run
the same plugin multiple times to get one job done (like Ton's example
for three check_procs services for testing one process)
Thomas posted links to
http://physics.nist.gov/cuu/Units/prefixes.html
http://physics.nist.gov/cuu/Units/binary.html
containing a list of metrics.
He claims that there should be a list of legal UOMs/prefixes and that
allowing base8 units should be discussed. Maybe some gnulib code
can be used for conversion (Ton). Andreas later noted that 0.2GiB !=
200MiB, while 0.2GB == 200MB which should be kept in mind.
Ton and Thomas agree that Perfdata should be in a fixed UOM and
not the one specified in the thresholds (at least for now).
- changing the threshold UOM will destroy old graphs
- Defining a base unit should be up to the respective plugins and be
as small as possible (sec,bytes,...)
- Thus uom is optional even when no thresholds are defined (like
--load1 to just graph load1)
Using scientific notation is omitted for now.
Ton and Thomas agree on dropping +-inf since the colon implies them.
But Nathan also thinks that ranges should always explicitly write both
sides of the range meaning 10:inf rather than 10:
Andreas could imagine that commandlines could become very complex
confusing users, but he has no better idea either.
Ton could imagine some helper functions (cmdline, web pages, google
calculator) to verify complex thresholds and Andreas likes to see a
possibility to shorten --freespace warn=inf:300KB
to --freespace w=inf:300KB.
Andreas also thinks that taking the simplicity off the plugins/specs
will take off one important advantage of nagios and that
Ton should be shot :D
Additionaly Ton (who should now fear the next nagios conference :) and
Andreas state that compatibility should and will be retained. At least
for versions prior 2.0.
Thomas dropped in to use getsubopt style arguments like --metric
min=2,uom_prefix=Ki,uom=b,.. which makes it easier
to keep backward compatibility when introducing new values.
Ton summarized that it all comes down to two things:
- range definition
- threshold definition
When it comes to ranges, there are two options: keeping existing ranges
using ':' or some math style 1<=x<=3 containing "quote me" characters
(like Max proposed).
But, however multiple styles *might* be possible and could be supported
parallely.
Thus the options for defining a threshold are (ignoring uom for the
moment):
1) --threshold-time=crit_range/warn_range
2) --threshold name=time,warn=range,crit=range
3) --threshold=time -w range -c range
Thomas thinks about something like
--threshold name=cpu,type=warn,min=0,max=80,inside
which would lead to another seperator if multiple ranges per metric
should (possibly) be supported.
Andreas also noted that the warn/crit sanity check needs to be different
depending
on plugin. Sometimes w < c sometimes w > c
Ton implemented a showcase for the a possible approach into check_procs:
./check_procs -C cron --number=^1:1 --rss-threshold=0: --vsz-
threshold=0: --cpu-threshold=-1:
But currently the non OK output doesn't state which threshold is
actually exceeded.
Nathan pointed out that it is more intuitive to specify only ok and
warning ranges.
Everything outside them is critical, which Ton thinks is "brilliant".
Something like:
--size_ok=300:500b --size_warn=500b:inf
or
--size=ok(300:500b),warn(500b:inf)
Nathan added that ':' could be replaced by '..' and using '/' as a range
seperator:
--time=ok/0..3/seconds
--freespace=ok/300..inf/KB,warn/100..300/KB
--load=ok/0..2,0..1.5,0..1.2/
--End of summary
So to me there are multiple open questions
Key questions:
- Must the threshold specification argument be valid without quoting?
- Is it necessary to allow multiple ranges per thresh warn=10:20,50:60?
- Should thresholds be defined ok/warn rather than warn/crit?
- Should plugins only print perfdata for explicitly selected metrics
or should there be a base set?
- Should there be an explicit range limit (10:inf over 10:)
- Is it favorable to have multiple range styles like
1<x<10 *and* 1:10 *and* ... in parallel?
Further questions:
- should perfdata inherit threshold's uom/prefix?
- replace range seperator ':' with '..'?
- Which component is responsible for sanity checking of thresholds?
- Should base8 UOM-prefixes be allowed?
I'll post my thoughts later on.
Hope this is useful.
Matthias
More information about the Devel
mailing list