[Nagiosplug-devel] RFC: Nagios 3 and Embedded Perl Plugins
Thomas Guyot-Sionnest
dermoth at aei.ca
Sat Jan 20 08:25:01 CET 2007
Thanks for the exhaustive answer. It looks very nice :)
On 10/01/07 05:15 AM, Andreas Ericsson wrote:
> Thomas Guyot-Sionnest wrote:
>>> -----Original Message-----
>>> From: nagiosplug-devel-bounces at lists.sourceforge.net
>>> [mailto:nagiosplug-devel-bounces at lists.sourceforge.net] On
>>> Behalf Of Andreas Ericsson
>>> Sent: January 9, 2007 8:39
>>> To: Nagios Plugin Development Mailing List
>>> Subject: Re: [Nagiosplug-devel] RFC: Nagios 3 and Embedded
>>> Perl Plugins
>>>
>>> Thomas Guyot-Sionnest wrote:
>>>> Actually I think now it's getting interesting. If done
>>> properly, this
>>>> could be a nice way of doing distributed active checking.
>>>>
>>>> Using the same system Stéphane described Nagios could have open
>>>> connections to remote execution hosts that runs the checks
>>> and read back
>>>> results. Different services properties would determine if
>>> the service
>>>> can be run directly on the host (if Nagios has an open
>>> connection to it)
>>>> or if it has to be remote. Check execution load could be run on
>>>> dedicated servers, or even be spread out across monitored hosts.
>>>>
>>> Yes, but a distributed static mesh redundancy thing is pretty
>>> different
>>> from an NRPE-daemon with an option to keep connections alive. A nice
>>> example of where "think big" doesn't work, but "think bigger" does.
>>>
>>> I'm working on a module that does just that, but it requires
>>> a fullblown
>>> Nagios installation on each of the poller nodes and the decision of
>>> which host is monitored by what system is determined by hostgroups
>>> instead of through some automagic solution that could possibly (and
>>> would probably) get things wrong from time to time.
>> Sounds great. Just to make things clear my idea wasn't a NRPE replacement but rather an addition. For ex. you could have something like this (Lets't call my thing NRCE, Nag[...] Command Executor):
>>
>> +---------+
>> | Nagios |
>> +---------+
>> |
>> |
>> +---------+
>> | NRCE |
>> +---------+
>> |
>> |
>> +---------+
>> | NRPE |
>> +---------+
>>
>
> Again, I believe "think big" isn't enough and "think bigger" is
> required. I'm probably biased though ;-)
>
>> The NRCE host could be a monitored host as well and remote network checks could be either coming from Nagios itself or from another NRCE host.
>>
>> It wouldn't require a full blown Nagios but your solution has the advantage of cascading checks. With that and by reducing the logic involved in the main Nagios process your solution is much more scalable.
>>
>> Do you plan on having the "child" Nagios processes receive their config automatically from the main process? That would simplify a lot the setup.
>>
>
> Yes. Here's how it's supposed to work in a scenario with only one master
> and one poller:
>
> * Poller node starts (or is restarted).
> * It loads the module which connects to master and requests config,
> sending the timestamp of its own config-file (only one) along with the
> request.
> * The master checks if an update is necessary.
> - if yes:
> * The master runs an external helper-script which extracts the
> parts of the config that the poller needs to know and feeds it to the
> poller.
> * The poller restarts itself and reads the new configuration file.
> * The master disables active checks for the services and hosts monitored
> by the poller.
> * The poller periodically sends a pulse to the master, telling it it's
> alive. A check-result is considered to be an "I'm alive" message.
> * If the poller goes down the master enables checks for the hosts and
> services previously checked by the poller.
>
>
> In the case of more than one master, the check-results are sent to each
> master, and the one with the most recent configuration is considered to
> be the grand-master for the duration of it's uptime.
>
> In the case of more than one poller where the pollers are redundant, the
> pollers send check-results to each other as well as to the master, so as
> to save the network load of having to do the check twice (redundant
> poller nodes are expected to be physically close to each other, so
> network traffic between them is of no concern).
>
> In the case of pollers which in turn have pollers underneath them, the
> masters (including the grandmaster) is considered tier1, the pollers
> tier2 and the pollers' pollers tier3. In this case, tier2 nodes disables
> checks for hosts/services that are handled by tier3 nodes while those
> tier3 nodes are alive. The tier2 nodes also acts as masters for the
> tier3 nodes, while still forwarding all their results (including those
> from the tier3 nodes) to the tier1 nodes.
>
> This is all very neat, because you can use all of the tier1, tier2 and
> tier3 nodes as full-blown nagios installations by the simple expedient
> of also installing the GUI package on them.
>
> Think for example a large international corporation where there is a
> network operations centre that has the tier 1 nodes that the network
> infrastructure administrators work with. They see the whole picture.
> In each country where there's an office you have a redundant set of tier
> 2 nodes that are responsible for handling all the checks in that
> country. If you check the GUI for one of those, you only see the checks
> for that country, which is most likely appropriate as the admins working
> at that site don't really need the whole picture.
> The tier 3 nodes are placed at the various branch offices and are
> responsible for checking the everything there. The branch office IT
> staff can log in to their server (which is close by and has good network
> performance) and see what's important to their network.
>
>
> I realize that the "see what's important to their network" can just as
> easily be done by massaging the config at the NOC a bit, but it's very
> convenient to have several servers with their own GUI, as it comes for
> free and will provide good network speed for the people in Taiwan even
> if the NOC is in Finland.
>
> Note that any or all of the tiers in the above scenario can be redundant
> with as many servers as you like. Also note that this allows an
> arbitrarily large network to be monitored without using some monster of
> a computer, although the master nodes probably require fairly heavy
> equipment to be able to perform reasonably, GUI-wise, with the number of
> monitored nodes one could expect for a network of this size.
>
> The reason it scales more or less indefinitely is that the bottleneck
> (formerly the command-pipe) is much larger (generally 132KiB vs 4KiB)
> and spread out over as many nodes as you like. The *real* bottleneck is,
> in this scenario, the amount of data you can feed to the master-servers,
> which shouldn't be less than 10Mbit/second in a network of this size.
> Most likely it's 100Mbit from tier2 to tier1, which is what matters.
>
> The amount of data required to transmit for each check is, on average,
> 400 bytes (a bunch of timestamps, return-code, plugin output), so
> assuming all bandwidth is used for transmitting check-results, that
> would allow for 3125 checks per second. With an average check-interval
> of 3 minutes, that's 562500 checks for the 10Mbit case. For 100Mbit we
> can use a fifth of the bandwidth and still check well in excess of 1
> million services with very frequent intervals.
>
> Sorry for the long mail.
>
More information about the Devel
mailing list