[Nagiosplug-devel] Re: SNMP + Nag Was: Kickoff for 1.5

Stanley Hopcroft Stanley.Hopcroft at IPAustralia.Gov.AU
Thu Mar 10 03:30:01 CET 2005


Dear Folks,

Not much here but,

On Wed, Mar 09, 2005 at 09:21:23PM -0500, Subhendu Ghosh wrote:
> On Wed, 9 Mar 2005, Harper Mann wrote:
> 
> >Hi Everyone,
> >
> >There are several items in an SNMP plugin discussion we're interested in 
> >and
> >are working on.  What I can remember off the top of my head is:
> >
> >1) How to manage and alarm on counter data like interface traffic, etc.  We
> >use check_rrd, which was mentioned earlier in this thread, and perhaps
> >that's sufficient since we customarily store and graph, but standardizing
> >this would be good.  We're not sure RRDTool will scale to sufficient size
> >installations.
> >

If the devices support RMON (and most do), then the alarm group 
transforms the problem into one of trap harvesting (ie define alarm 
thresholds on trunks in the switch/router) and have it send traps when 
the threshold is exceeded. Only con is static, non adapatives, 
thresholds.

See below if you want to allow for diurnal/seasonal variation.

> >2) We've had a request to collect 3-4 SNMP values (in, out, errors) from
> >more than 10,000 interfaces every 15 minutes so we're looking into how to
> >scale to such a large installation.  Aside from how to get plugins to keep
> >up with collecting, what's the best way to store so much performance data?
> >
> >3) Fix the performance data so it conforms to the project standards and
> >manages OIDs and Symbolic names well for multiple requests.
> >
> 
> Separate out the functionality  - Nagios is primarily a fault management 
> tool. For 10k interface performance choose a performance 
> management(monitor) tool.
> 

Absolutely. I think the nomenclature is 

1 a poller/collector - interrogates the thingys and saves the data

2 an analyser/presenter - summarise the saved data and report by various 
means.

These are best implemented as separate processes so they can perform 
without tradeoffs. 

Non blocking IO with Net::SNMP out performs forking an Net::SNMP::get.

Storing data in RRDs has the advantages that

1 Lots of third party applications know and love RRDs (orca, cricket)

2 The Holt-Winters time series prediction algorithm can let the analyser 
distinguish a daily surge from an anomaly/problem

NB Toby the RRD man haa got funding from a client to bring the dev 
branch RRD - with the HW stuff - into supported production form.

3 the RRDs are self maintaining. Except in exceptional cases there is no 
need to unload and resize databases when the db fills up (it never does)

4 the storage of an RRD never exceeds what is allocated when the RRD is 
created.

> I've been partial to Cricket to snmp data collection - the snmp 
engine is 
> pretty well designed so that each device is only contacted once and all 
> the different oids are requested together. (cricket.sf.net)
> I've seen it scale quite well so long as you can stagger the the hosts 
> groups (ie. not everything runs at the same 15 min interval) and you can 
> use snmp v2 and get-bulk
> 
> For alarms - either check_rrd or snmptraps from Cricket (and possibly 
> 2Cacti in the near future).

Sounds good to me if you can't get RMON (or don't/can't configure your 
devices - although that were the case, you prob couldn't poll them).

> 
> By forcing Nagios to do traffic measurements from snmp - the scalability 
> is not present based on the plugin architecture.  You need something else 
> to do the active monitor and check the results.


Here Here. Let Nag present part of the conclusions - its neat to have 
the plugin output return a hyperlink to an RRDtool or other CGI that 
allows the Nag viewer to display the RRDtool graphs.

>  For small installs that 
> don't want multiple tools, it would work, but large installs like yours 
> should definitely use separate tools.

Amen brother.

> 
> I used to monitor about the same number of interfaces with mrtg arounf 
> '98-'00.  disk i/o was the biggest issue. (ram disk to the rescue).
> 
> RRDtool scales as well as the underlying hardware (disk i/o) and file 
> layout.

The bottle neck is more likely to be in the poller than RRDtool in my 
view (that's why there are fpings and so on).

> 
> -- 
> -sg
> 

Does this adequately sum up what's been presented that's relevant to Nag 
SNMP plugins ?


1 the plugins should probably confine themselves to checking state 
rather than collecting/storing performance data (leaving this to a 
standalone poller that may or may not interact with Nag directly)

2 traffic thresholds are best dealt with by

2.1 standalone poller + analyser submitting passive service check 
results to Nag (possibly via traps to a trap collector), or

2.2 device specific means (RMON)

3 The probs of dealing with large numbers of communitys remain although 
it seems to me that the -C option should go a long way to help (maybe in 
conjunction with a heap of included files defining different arguments 
for commands.

4 Plugins that save/store state probably don't scale and should thereby 
be excluded from developer focus

5 It may be worth recognising that SNMP pollers/managers are a good 
supplement to Nag; the poller is getting close to peak development and 
therefore effort is only needed in exploiting synergy rather than 
seeking to do it again with plugins.

Yours sincerely.

-- 
Stanley Hopcroft

IP Australia
Ph: (02) 6283 3189  Fax: (02) 6281 1353
PO Box 200 Woden  ACT 2606
http://www.ipaustralia.gov.au
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: disclaimer.txt
URL: <http://nagios-plugins.org/archive/devel/attachments/20050310/26d2ca69/attachment.txt>


More information about the Devel mailing list