[Nagiosplug-devel] RFC: Performance data guidelines
kjell.sundtjonn at elkem.no
kjell.sundtjonn at elkem.no
Sat Jul 12 09:43:20 CEST 2003
I really like the idea of including the critical and warning level together
with max and min values in the performance data, but let me propose an
alternative layout based on colon (:) separated fields :
- output of format 'label=value[UOM]:[critical]:[warning]:[max]:[min]'
comma separated
- labels 1-19 characters long in class [a-zA-Z0-9_] (spaces allowed, but
not recommended)
- values, critical, warning, max, min in class [-0-9.]. No spaces.
- critical and warning is the thresholds for this measurement
- max and min is the maximum/minimum value for the measurement
It think this is easier to parse than the proposal from Ton based on
'magical' words.
Example
Disk space
DISK OK [22118452 kB (84%) free on /dev/hda3] [81574 kB (85%) free on
/dev/hda2] [252600 kB (100%) free on /dev/shm]|_dev_hda3=84%:10:25:100:0,
_dev_hda2=85%:10:25:100:0,_dev_shm=100%:10:25:100:0
For disk space and other plugins where the UOM is defined when the plugin
is called, use the active OUM as the value for the performance data. Notice
how the / is replaced with _ to ensure a valid RRD datasource name. It is
necessary to show the performance data for each disk in a disk set, not
only for the total as Ton proposes.
PING
PING OK - Packet loss = 0%, RTA =
1.00ms|packet_loss=0%:20:10:100:0,RTA=1ms:20:30::0
The empty max value for RTA is understood as undefined.
Kjell Sundtjønn
|---------+-------------------------------------------->
| | "Voon, Ton" <Ton.Voon at egg.com> |
| | Sent by: |
| | nagiosplug-devel-admin at lists.sour|
| | ceforge.net |
| | |
| | |
| | 11.07.2003 16:10 |
| | |
|---------+-------------------------------------------->
>----------------------------------------------------------------------------------------------|
| |
| To: NagiosPlug Devel <nagiosplug-devel at lists.sourceforge.net> |
| cc: |
| Subject: RE: [Nagiosplug-devel] RFC: Performance data guidelines |
>----------------------------------------------------------------------------------------------|
I'm starting to side with Kjell's and Karl's idea of labels being separate
from the units. I think that was the flaw in my original proposal - if we
can standarise on the units, then RRD generation should be fairly easy and
then you can keep labels descriptive and whatever you think is suitable for
a particular plugin.
So my amended proposal is:
- output of format 'label=value[UOM]' comma separated
- labels 1-19 characters long in class [a-zA-Z0-9_] (should spaces be
allowed?)
- special labels of warn, warnp, crit and critp (or just warn and crit with
different units?). These pass the threshold levels specified on the command
line. My idea on this is that you can then use RRD to draw yellow/red lines
to show where the warning levels are.
- values in class [-0-9.]. No spaces. Karl has a worry about returned
values
from SNMP OIDs, but I think values should always be a number, so it can be
parsed to remove extraneous characters
- units one of:
no unit specified - assume a number (int or float) of things (users,
processes, load averages)
s - seconds (also, us, ms)
% - percentage
b - bytes (also kb, Mb, Tb)
c - a continuous counter (such as bytes transmitted on an interface) (Does
this interfere with a standard unit?)
So some examples:
check_ping:
PING OK - Packet loss = 0%, RTA = 1.00
ms|packet_loss=0%,rta=1ms,warnp=10%,critp=20%
check_disk:
DISK OK [1150211 kB (57%) free on
/dev/dsk/c0t0d0s0]|free_percent=57%,free=1150Mb,warn=100Mb,warnp=10%
I still think that you do not need the total, used and used_percent because
these are calculatable from free and free_percent. I would also use free
rather than used because the lowest limit is 0 and the output shows free. I
think if you specify a set of disks, then data is returned for the total of
the disks.
check_swap:
CRITICAL - Swap used: 18% (778368 out of
4194272)|free_percent=82%,free=778Mb,warnp=5%
check_load:
OK - load average: 0.03, 0.04, 0.05|load1=0.03,warn=1,crit=2
I think we should only return performance data for 1 set of timings,
otherwise it gets very complicated (on a side issue, it is possible to have
a plugin return % values instead of load levels?)
check_procs:
OK - 5 processes running with command name
/usr/local/apache/bin/httpd|processes=5,warn=10
Hmmm, this goes against my check_disk example of using 0 as a lower bound
as
check_procs can only be reported "upwards"
check_users:
USERS OK - 2 users currently logged in|users=2,warn=10,crit=20
Are we getting closer?
Ton
This private and confidential e-mail has been sent to you by Egg.
The Egg group of companies includes Egg Banking plc
(registered no. 2999842), Egg Financial Products Ltd (registered
no. 3319027) and Egg Investments Ltd (registered no. 3403963) which
carries out investment business on behalf of Egg and is regulated
by the Financial Services Authority.
Registered in England and Wales. Registered offices: 1 Waterhouse Square,
138-142 Holborn, London EC1N 2NA.
If you are not the intended recipient of this e-mail and have
received it in error, please notify the sender by replying with
'received in error' as the subject and then delete it from your
mailbox.
-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
www.parasoft.com/bulletproofapps1
_______________________________________________
Nagiosplug-devel mailing list
Nagiosplug-devel at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nagiosplug-devel
::: Please include plugins version (-v) and OS when reporting any issue.
::: Messages without supporting info will risk being sent to /dev/null
More information about the Devel
mailing list