[Nagiosplug-devel] Working on testcases

Andreas Ericsson ae at op5.se
Mon Nov 7 04:21:55 CET 2005
Previous message: [Nagiosplug-devel] Working on testcases
Next message: [Nagiosplug-devel] Working on testcases
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Ton Voon wrote:
> 
> On 7 Nov 2005, at 09:52, Ton Voon wrote:
> 
>> Hi!
>>
>> This is an interesting and important thread and I seem to have got  
>> some strong opinions, so we should continue with this until we get  a 
>> result.
>>
>> Just going to summarise where we are:
>>
>> PROBLEM
>>
>> While working on testcases, have noticed that "name resolution  
>> failure" now returns UNKNOWN instead of CRITICAL. What exactly  should 
>> UNKNOWN mean?
>>
>> VIEWS
>>
>> John Rouillard suggested command line option for user to choose  
>> return code, but Ton Voon thinks this would overcomplicate. John  
>> retracted suggestion.
>>
>> Garrett Honeycutt suggested configure time option for return code,  
>> but Andreas Ericsson thinks this is bad because compiled binaries  
>> should behave identically across platforms. I think the  "configurable 
>> return code" suggestion can be dropped.
>>
>> John suggests separating "host not found" and "cannot resolve"  
>> exceptions, so the former is a CRITICAL and the latter is an  UNKNOWN, 
>> which is an interesting idea but I'm not sure what the  philosophy of 
>> this is.
>>
>> Andreas suggests a new status code in Nagios: "Transport/network  
>> error", and then UNKNOWN will mean "user error". With no network  
>> error state supported, Andreas suggests using UNKNOWN.
>>
>> John's analysis is that there are two functions of a plugin:
>>
>>   1) communication with device/service
>>   2) analysis of device/service and assigning appropriate status  [and 
>> perf data]
>>
>> MY TAKE
>>
>> Trying to tie these views together, I think "transport/network"  
>> errors goes into (1). John's suggestion about "host not found" and  
>> "cannot resolve" go into (1) as well, but then this suggests there  is 
>> no difference in state.
>>
>> My feeling is that (2) depends on (1), so if (1) is not possible -  
>> for ANY reason - then I think that should be a CRITICAL (with  
>> appropriate message text). I think Nagios helps with the "transport/ 
>> network" error with things like "flapping" and "soft states" (I  think 
>> Nagios works well because it doesn't try and come up with  lots of 
>> different plugin states and just keeps it simple).
>>
>> I think Garrett summed it up best for me: "I would rather get false  
>> positives than miss something because the status was UNKNOWN as  
>> opposed to CRITICAL"
>>
>> NEXT STEPS
>>
>> I think we need to bat this around a bit more to get consensus. If  it 
>> gets to the stage where we need a vote, I'm happy to cast one  out to 
>> the community.
> 
> 
> Sorry to reply to my own post, but after discussing this with a  
> colleague, I've changed my mind. I'm interested to hear what other  
> people think about this suggestion:
> 
>   - Failures to communicate with the device/service is considered to  be 
> "UNKNOWN"
> 
> The idea is that a Nagios administrator would send UNKNOWN alerts to  
> themselves, while service owners (eg, DBAs, web masters, network  
> administrators) would only receive WARNING/CRITICAL alerts. This is  
> driven by the desire to alert to the "right people".
> 
> Consider this scenario:
> 
>   0) Nagios administrator receives all UNKNOWN alerts
>   1) A Nagios administrator sets up a "check oracle for number of  
> users" servicecheck. This alerts to DBAs on WARN/CRIT if above a  
> certain number
>   2) This plugin fails with some transport error and thus returns  
> UNKNOWN. Notification sent to Nagios admin
>   3) On investigation, Nagios admin discovers that host has run out  of 
> memory. Curses System Administrator. Sets up a new service  specifically 
> to check that host has enough memory. Makes this a  dependency that 
> check_oracle_users requires check_memory
>   4) Next time this happens, check_oracle_users returns unknown,  
> check_memory kicks in and returns CRITICAL, so System Administrators  
> are notified and the Nagios admin can continue web surfing
> 
> I am very interested in "placing the pain" at the right owners. At  
> stage (2), the pain arrives to the Nagios admin because the model for  
> their infrastructure is incomplete. However, the Nagios admin is the  
> right owner because then they can make the  changes in step (3). So  
> next time this specific case happens, the pain arrives on the System  
> Administrator instead.
> 
> If transport errors were set to CRITICAL, then the pain at stage (2)  
> would go to the DBA, who would say "bloody monitoring system".
> 
> If you like to think graphically, think of the Nagios server at one  end 
> with the servicecheck (in a red & yellow spot) at the other end.  
> Everything in between is UNKNOWN. This is the responsibility of the  
> Nagios Administrator. To reduce their responsibility, they need to  
> cover as many points in between as possible, which is basically their  
> job. If everything in between was CRITICAL, then this implies the  
> responsibility belongs to the service owner, but their primary job is  
> to keep the service running, not to cover these other scenarios.
> 
> So I'm being swayed over to UNKNOWNs for all transport errors. This  
> would mean that something like check_procs would return UNKNOWN if  the 
> ps command was not available or returned incorrect data, which  fits 
> with this philosophy. Similarly so, hostname lookups would  therefore 
> come under UNKNOWN.
> 
> Extending this philosophy to check_nrpe means that connection  problems 
> to the nrpe agent would be considered UNKNOWN (which makes  sense 
> because the pain is to the Nagios Administrator) and - since it  does 
> not actually check anything itself - should only ever raise an  UNKNOWN 
> exception.
> 
> Make sense? Any other comments?
> 

I'd just like to point out that this is in no way incompatible with the 
"transport error" service-status, since nagios by default sets all 
out-of-bounds return codes to UNKNOWN.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231
Previous message: [Nagiosplug-devel] Working on testcases
Next message: [Nagiosplug-devel] Working on testcases
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
More information about the Devel mailing list