[Nagiosplug-devel] Working on testcases
Andreas Ericsson
ae at op5.se
Mon Nov 7 04:21:55 CET 2005
Ton Voon wrote:
>
> On 7 Nov 2005, at 09:52, Ton Voon wrote:
>
>> Hi!
>>
>> This is an interesting and important thread and I seem to have got
>> some strong opinions, so we should continue with this until we get a
>> result.
>>
>> Just going to summarise where we are:
>>
>> PROBLEM
>>
>> While working on testcases, have noticed that "name resolution
>> failure" now returns UNKNOWN instead of CRITICAL. What exactly should
>> UNKNOWN mean?
>>
>> VIEWS
>>
>> John Rouillard suggested command line option for user to choose
>> return code, but Ton Voon thinks this would overcomplicate. John
>> retracted suggestion.
>>
>> Garrett Honeycutt suggested configure time option for return code,
>> but Andreas Ericsson thinks this is bad because compiled binaries
>> should behave identically across platforms. I think the "configurable
>> return code" suggestion can be dropped.
>>
>> John suggests separating "host not found" and "cannot resolve"
>> exceptions, so the former is a CRITICAL and the latter is an UNKNOWN,
>> which is an interesting idea but I'm not sure what the philosophy of
>> this is.
>>
>> Andreas suggests a new status code in Nagios: "Transport/network
>> error", and then UNKNOWN will mean "user error". With no network
>> error state supported, Andreas suggests using UNKNOWN.
>>
>> John's analysis is that there are two functions of a plugin:
>>
>> 1) communication with device/service
>> 2) analysis of device/service and assigning appropriate status [and
>> perf data]
>>
>> MY TAKE
>>
>> Trying to tie these views together, I think "transport/network"
>> errors goes into (1). John's suggestion about "host not found" and
>> "cannot resolve" go into (1) as well, but then this suggests there is
>> no difference in state.
>>
>> My feeling is that (2) depends on (1), so if (1) is not possible -
>> for ANY reason - then I think that should be a CRITICAL (with
>> appropriate message text). I think Nagios helps with the "transport/
>> network" error with things like "flapping" and "soft states" (I think
>> Nagios works well because it doesn't try and come up with lots of
>> different plugin states and just keeps it simple).
>>
>> I think Garrett summed it up best for me: "I would rather get false
>> positives than miss something because the status was UNKNOWN as
>> opposed to CRITICAL"
>>
>> NEXT STEPS
>>
>> I think we need to bat this around a bit more to get consensus. If it
>> gets to the stage where we need a vote, I'm happy to cast one out to
>> the community.
>
>
> Sorry to reply to my own post, but after discussing this with a
> colleague, I've changed my mind. I'm interested to hear what other
> people think about this suggestion:
>
> - Failures to communicate with the device/service is considered to be
> "UNKNOWN"
>
> The idea is that a Nagios administrator would send UNKNOWN alerts to
> themselves, while service owners (eg, DBAs, web masters, network
> administrators) would only receive WARNING/CRITICAL alerts. This is
> driven by the desire to alert to the "right people".
>
> Consider this scenario:
>
> 0) Nagios administrator receives all UNKNOWN alerts
> 1) A Nagios administrator sets up a "check oracle for number of
> users" servicecheck. This alerts to DBAs on WARN/CRIT if above a
> certain number
> 2) This plugin fails with some transport error and thus returns
> UNKNOWN. Notification sent to Nagios admin
> 3) On investigation, Nagios admin discovers that host has run out of
> memory. Curses System Administrator. Sets up a new service specifically
> to check that host has enough memory. Makes this a dependency that
> check_oracle_users requires check_memory
> 4) Next time this happens, check_oracle_users returns unknown,
> check_memory kicks in and returns CRITICAL, so System Administrators
> are notified and the Nagios admin can continue web surfing
>
> I am very interested in "placing the pain" at the right owners. At
> stage (2), the pain arrives to the Nagios admin because the model for
> their infrastructure is incomplete. However, the Nagios admin is the
> right owner because then they can make the changes in step (3). So
> next time this specific case happens, the pain arrives on the System
> Administrator instead.
>
> If transport errors were set to CRITICAL, then the pain at stage (2)
> would go to the DBA, who would say "bloody monitoring system".
>
> If you like to think graphically, think of the Nagios server at one end
> with the servicecheck (in a red & yellow spot) at the other end.
> Everything in between is UNKNOWN. This is the responsibility of the
> Nagios Administrator. To reduce their responsibility, they need to
> cover as many points in between as possible, which is basically their
> job. If everything in between was CRITICAL, then this implies the
> responsibility belongs to the service owner, but their primary job is
> to keep the service running, not to cover these other scenarios.
>
> So I'm being swayed over to UNKNOWNs for all transport errors. This
> would mean that something like check_procs would return UNKNOWN if the
> ps command was not available or returned incorrect data, which fits
> with this philosophy. Similarly so, hostname lookups would therefore
> come under UNKNOWN.
>
> Extending this philosophy to check_nrpe means that connection problems
> to the nrpe agent would be considered UNKNOWN (which makes sense
> because the pain is to the Nagios Administrator) and - since it does
> not actually check anything itself - should only ever raise an UNKNOWN
> exception.
>
> Make sense? Any other comments?
>
I'd just like to point out that this is in no way incompatible with the
"transport error" service-status, since nagios by default sets all
out-of-bounds return codes to UNKNOWN.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
More information about the Devel
mailing list