[Nagiosplug-devel] Adding more advanced correlation to nagios with sec (any interest?)
John P. Rouillard
rouilj at cs.umb.edu
Sat Jun 28 13:17:01 CEST 2003
Hi all:
I just want to say that I am very pleased with nagios, and it makes
my life much easier.
However, I have some things that I want to do that are not easily
done within nagios. E.G.
If a system jumpstart is in progress, ignore warnings about high
interface usage (on one interface), or dropped packets (on the
hub).
If an index operation of the HTTP server is in progress, ignore
warnings about the http interface being slow.
I also want to show a host/service down if a given system went down,
(as determined by a syslog message) but I want the report done
ONLY if the system isn't back up in 5 minutes.
Note that none of the rebooting, indexing, or jumpstarting operations
occur at fixed times, so I can't schedule these in advance.
Other things can sort of be done in nagios, but it is a bit tough to
configure. E.G. I have a single snmp_trap service defined for my
hosts. The service is considered volatile, and is state_stalked. I
want to do the following:
If an (particular range of) interfaces on a switch goes down (and
sends a trap) ignore it unless it has gone down/up 3 times in
five minutes. Don't clear it until it has stayed up for at least
15 minutes.
Other interfaces on the same switch should be reported immediately
I could do part of this by adding every one of my 20 interfaces on the
switch as services, but that doesn't really handle the timing aspects.
It makes the services a lot more difficult to read and configure.
Another thing I want to do is:
Synthesize an event that notes if two of my three links to
a remote site are having problems. That is two of my three
routers may be in a warn state, and I want to place the
"Access to 16 net" service in a critical state.
This can be done by event handlers, but you end up writing a portion
of sec to do it, so you might just as well use sec in the first place.
I have a method of integrating sec <http://www.estpak.ee/~risto/sec/>
into nagios to handle these issues and more.
Using sec to process traps (or other passive checks) is straight
forward. The trap collector running from snmptrapd just dumps the trap
report (formatted as a nagios passive service check) into sec's input
fifo and then sec processes it, and reports it (if needed) into the
nagios.cmd pipe.
However for polled items, it more difficult. I don't want to have a
flapping service where the plugin determines that there is a problem,
nagios reacts to that, and then sec reacts to that (being fed its info
by an event handler) by clearing the service because sec determines
that there is not yet a problem. This leads to a flapping service as
nagios and sec disagree on what is a true problem, and leads to
spurious notifications because I can't put in a high
max_check_attempts and have nagios respond to sec when it has a real
problem (unless I define yet another service yech).
What I did was write a plugin in perl (sec_filter) that runs the
nagios command (sort of like check_ssh). It always passes the output
of the plugin to sec's input pipe. However, depending on the flags
given to the sec_filter script, it will exit:
with an "ignore OK" code, and no output
with an "ignore ERROR" code, and no output
with the exit code and output of the plugin
I have chosen exit status of 5 for "ignore OK" and 6 for "ignore
ERROR". (It looks like code 4 is used internally for pending states,
and I didn't want to use that number hence my choice of 5 and 6.)
The reason for these new codes is to make nagios not change any status
for the polled service based on the poll. The new status will be sent
to it by a passive check command generated from sec.
That is I want nagios to be a (almost) dumb poller and to let sec
filter all the data. Using sec provides much better control over flap
detection, and multiple service correlation. Above I said I wanted
nagios to be an almost dumb poller. This is because I want nagios to
poll at the retry_check_interval if there is a problem found by the
plugin. If sec_filter exits with status 6, then nagios polls at the
faster retry interval. This allows sec to better determine the trouble
the system is in, or more easily determine when the system recovers.
I have set it up so that sec itself is a passive nagios service, and
automatically sends notifications to nagios, as well as nagios being
able to poll the sec service if its data gets stale.
So is anybody interested in my mods (about 30 lines) to nagios to
support this, and my plugin?
Note, there is a issue with sec in that ;'s can't be embedded in its
action commands. This is a problem since nagios' passive commands are ;
delimited. There should be a new version of sec out (2.1.8) once
testing is complete that addresses this issue.
-- rouilj
John Rouillard
===========================================================================
My employers don't acknowledge my existence much less my opinions.
More information about the Devel
mailing list