[Nagiosplug-devel] Working on testcases

Ton Voon ton.voon at altinity.com
Mon Nov 7 03:59:10 CET 2005


On 7 Nov 2005, at 09:52, Ton Voon wrote:

> Hi!
>
> This is an interesting and important thread and I seem to have got  
> some strong opinions, so we should continue with this until we get  
> a result.
>
> Just going to summarise where we are:
>
> PROBLEM
>
> While working on testcases, have noticed that "name resolution  
> failure" now returns UNKNOWN instead of CRITICAL. What exactly  
> should UNKNOWN mean?
>
> VIEWS
>
> John Rouillard suggested command line option for user to choose  
> return code, but Ton Voon thinks this would overcomplicate. John  
> retracted suggestion.
>
> Garrett Honeycutt suggested configure time option for return code,  
> but Andreas Ericsson thinks this is bad because compiled binaries  
> should behave identically across platforms. I think the  
> "configurable return code" suggestion can be dropped.
>
> John suggests separating "host not found" and "cannot resolve"  
> exceptions, so the former is a CRITICAL and the latter is an  
> UNKNOWN, which is an interesting idea but I'm not sure what the  
> philosophy of this is.
>
> Andreas suggests a new status code in Nagios: "Transport/network  
> error", and then UNKNOWN will mean "user error". With no network  
> error state supported, Andreas suggests using UNKNOWN.
>
> John's analysis is that there are two functions of a plugin:
>
>   1) communication with device/service
>   2) analysis of device/service and assigning appropriate status  
> [and perf data]
>
> MY TAKE
>
> Trying to tie these views together, I think "transport/network"  
> errors goes into (1). John's suggestion about "host not found" and  
> "cannot resolve" go into (1) as well, but then this suggests there  
> is no difference in state.
>
> My feeling is that (2) depends on (1), so if (1) is not possible -  
> for ANY reason - then I think that should be a CRITICAL (with  
> appropriate message text). I think Nagios helps with the "transport/ 
> network" error with things like "flapping" and "soft states" (I  
> think Nagios works well because it doesn't try and come up with  
> lots of different plugin states and just keeps it simple).
>
> I think Garrett summed it up best for me: "I would rather get false  
> positives than miss something because the status was UNKNOWN as  
> opposed to CRITICAL"
>
> NEXT STEPS
>
> I think we need to bat this around a bit more to get consensus. If  
> it gets to the stage where we need a vote, I'm happy to cast one  
> out to the community.

Sorry to reply to my own post, but after discussing this with a  
colleague, I've changed my mind. I'm interested to hear what other  
people think about this suggestion:

   - Failures to communicate with the device/service is considered to  
be "UNKNOWN"

The idea is that a Nagios administrator would send UNKNOWN alerts to  
themselves, while service owners (eg, DBAs, web masters, network  
administrators) would only receive WARNING/CRITICAL alerts. This is  
driven by the desire to alert to the "right people".

Consider this scenario:

   0) Nagios administrator receives all UNKNOWN alerts
   1) A Nagios administrator sets up a "check oracle for number of  
users" servicecheck. This alerts to DBAs on WARN/CRIT if above a  
certain number
   2) This plugin fails with some transport error and thus returns  
UNKNOWN. Notification sent to Nagios admin
   3) On investigation, Nagios admin discovers that host has run out  
of memory. Curses System Administrator. Sets up a new service  
specifically to check that host has enough memory. Makes this a  
dependency that check_oracle_users requires check_memory
   4) Next time this happens, check_oracle_users returns unknown,  
check_memory kicks in and returns CRITICAL, so System Administrators  
are notified and the Nagios admin can continue web surfing

I am very interested in "placing the pain" at the right owners. At  
stage (2), the pain arrives to the Nagios admin because the model for  
their infrastructure is incomplete. However, the Nagios admin is the  
right owner because then they can make the  changes in step (3). So  
next time this specific case happens, the pain arrives on the System  
Administrator instead.

If transport errors were set to CRITICAL, then the pain at stage (2)  
would go to the DBA, who would say "bloody monitoring system".

If you like to think graphically, think of the Nagios server at one  
end with the servicecheck (in a red & yellow spot) at the other end.  
Everything in between is UNKNOWN. This is the responsibility of the  
Nagios Administrator. To reduce their responsibility, they need to  
cover as many points in between as possible, which is basically their  
job. If everything in between was CRITICAL, then this implies the  
responsibility belongs to the service owner, but their primary job is  
to keep the service running, not to cover these other scenarios.

So I'm being swayed over to UNKNOWNs for all transport errors. This  
would mean that something like check_procs would return UNKNOWN if  
the ps command was not available or returned incorrect data, which  
fits with this philosophy. Similarly so, hostname lookups would  
therefore come under UNKNOWN.

Extending this philosophy to check_nrpe means that connection  
problems to the nrpe agent would be considered UNKNOWN (which makes  
sense because the pain is to the Nagios Administrator) and - since it  
does not actually check anything itself - should only ever raise an  
UNKNOWN exception.

Make sense? Any other comments?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon






More information about the Devel mailing list