[Nagiosplug-devel] Working on testcases
Ton Voon
ton.voon at altinity.com
Mon Nov 7 03:59:10 CET 2005
On 7 Nov 2005, at 09:52, Ton Voon wrote:
> Hi!
>
> This is an interesting and important thread and I seem to have got
> some strong opinions, so we should continue with this until we get
> a result.
>
> Just going to summarise where we are:
>
> PROBLEM
>
> While working on testcases, have noticed that "name resolution
> failure" now returns UNKNOWN instead of CRITICAL. What exactly
> should UNKNOWN mean?
>
> VIEWS
>
> John Rouillard suggested command line option for user to choose
> return code, but Ton Voon thinks this would overcomplicate. John
> retracted suggestion.
>
> Garrett Honeycutt suggested configure time option for return code,
> but Andreas Ericsson thinks this is bad because compiled binaries
> should behave identically across platforms. I think the
> "configurable return code" suggestion can be dropped.
>
> John suggests separating "host not found" and "cannot resolve"
> exceptions, so the former is a CRITICAL and the latter is an
> UNKNOWN, which is an interesting idea but I'm not sure what the
> philosophy of this is.
>
> Andreas suggests a new status code in Nagios: "Transport/network
> error", and then UNKNOWN will mean "user error". With no network
> error state supported, Andreas suggests using UNKNOWN.
>
> John's analysis is that there are two functions of a plugin:
>
> 1) communication with device/service
> 2) analysis of device/service and assigning appropriate status
> [and perf data]
>
> MY TAKE
>
> Trying to tie these views together, I think "transport/network"
> errors goes into (1). John's suggestion about "host not found" and
> "cannot resolve" go into (1) as well, but then this suggests there
> is no difference in state.
>
> My feeling is that (2) depends on (1), so if (1) is not possible -
> for ANY reason - then I think that should be a CRITICAL (with
> appropriate message text). I think Nagios helps with the "transport/
> network" error with things like "flapping" and "soft states" (I
> think Nagios works well because it doesn't try and come up with
> lots of different plugin states and just keeps it simple).
>
> I think Garrett summed it up best for me: "I would rather get false
> positives than miss something because the status was UNKNOWN as
> opposed to CRITICAL"
>
> NEXT STEPS
>
> I think we need to bat this around a bit more to get consensus. If
> it gets to the stage where we need a vote, I'm happy to cast one
> out to the community.
Sorry to reply to my own post, but after discussing this with a
colleague, I've changed my mind. I'm interested to hear what other
people think about this suggestion:
- Failures to communicate with the device/service is considered to
be "UNKNOWN"
The idea is that a Nagios administrator would send UNKNOWN alerts to
themselves, while service owners (eg, DBAs, web masters, network
administrators) would only receive WARNING/CRITICAL alerts. This is
driven by the desire to alert to the "right people".
Consider this scenario:
0) Nagios administrator receives all UNKNOWN alerts
1) A Nagios administrator sets up a "check oracle for number of
users" servicecheck. This alerts to DBAs on WARN/CRIT if above a
certain number
2) This plugin fails with some transport error and thus returns
UNKNOWN. Notification sent to Nagios admin
3) On investigation, Nagios admin discovers that host has run out
of memory. Curses System Administrator. Sets up a new service
specifically to check that host has enough memory. Makes this a
dependency that check_oracle_users requires check_memory
4) Next time this happens, check_oracle_users returns unknown,
check_memory kicks in and returns CRITICAL, so System Administrators
are notified and the Nagios admin can continue web surfing
I am very interested in "placing the pain" at the right owners. At
stage (2), the pain arrives to the Nagios admin because the model for
their infrastructure is incomplete. However, the Nagios admin is the
right owner because then they can make the changes in step (3). So
next time this specific case happens, the pain arrives on the System
Administrator instead.
If transport errors were set to CRITICAL, then the pain at stage (2)
would go to the DBA, who would say "bloody monitoring system".
If you like to think graphically, think of the Nagios server at one
end with the servicecheck (in a red & yellow spot) at the other end.
Everything in between is UNKNOWN. This is the responsibility of the
Nagios Administrator. To reduce their responsibility, they need to
cover as many points in between as possible, which is basically their
job. If everything in between was CRITICAL, then this implies the
responsibility belongs to the service owner, but their primary job is
to keep the service running, not to cover these other scenarios.
So I'm being swayed over to UNKNOWNs for all transport errors. This
would mean that something like check_procs would return UNKNOWN if
the ps command was not available or returned incorrect data, which
fits with this philosophy. Similarly so, hostname lookups would
therefore come under UNKNOWN.
Extending this philosophy to check_nrpe means that connection
problems to the nrpe agent would be considered UNKNOWN (which makes
sense because the pain is to the Nagios Administrator) and - since it
does not actually check anything itself - should only ever raise an
UNKNOWN exception.
Make sense? Any other comments?
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
More information about the Devel
mailing list