Thursday, November 26, 2009

Quick-n-Dirty Nagios Plugins

In the spirit of the holidays -- and of spending more time at home, away from the server room -- I'd like to present a sample script from the many that I use to monitor (from home, the beach, or my mythical cabin in the woods) my computers' collective health. Obviously, Nagios itself comes with a rich collection of plugins, and plenty of people have written additional check scripts for most common services, but at some point, a system administrator is likely to be responsible for services which for whatever reason aren't covered by existing scripts, so having a template to whip up your own script can be invaluable. Before I go any further, a couple of disclaimers -- nothing here should be taken as a style guide to writing shippable, enterprise-class Nagios plugins; note the title: "Quick-n-Dirty". The example script I've chosen, which checks the status of a Windows DHCP Server address pool, is production-worthy, but fairly case-specific. There is no --usage flag, the threshhold values are hard coded into the script, and the arguments must be given positionally. Each of these could easily be fixed, but even the cost small of doing so isn't justifiable in my environment.

check_dhcp.txt

Basic requirements:
  • Python

  • SSH keyed access to the Windows server -- I use Bitvise WinSSHD, which costs money in a commercial environment, but there are other options, such as FreeSSHd.

  • Nagios, obviously, is needed to get much value out of this script

Usage:
./check_dhcp <hostname> <subnet>

First, a little bit of python boilerplate code:
#!/usr/bin/env python
import os, sys, re

Next, a few variables to keep track of what we report to nagios:
STATUS = ["OK", "WARNING", "CRITICAL", "UNKNOWN"]
status = "UNKNOWN"
service = "DHCP"

The STATUS array maps the allowable status values to the exit codes that they correspond to, and the initial status is set to UNKNOWN, which is what the script should report if anything goes wrong with the script itself during execution. Finally, I set the service name here, though it's not strictly needed.
warn_level = .8
crit_level = .9

used = free = total = 0
used_percent = 1

The first two variables above are the hard coded threshold values I mentioned earlier, and the rest are the variables I'll use to keep track of the status of the DHCP pool. Note below, where we set the command that we'll actually run to get the status data, that the location of ssh is hard coded, and might need to be adjusted for your environment:
check_cmd = r'/usr/bin/ssh %(server)s "netsh dhcp server show mibinfo"' % {"server": sys.argv[1]}
subnet = sys.argv[2]

The next section parses the output of the check command line by line, looking first for the specified subnet, then for the usage data for that subnet:
ss = None
for ll in os.popen3(check_cmd)[1].readlines():
rr = re.match(r"\s*Subnet = (?P<ss>[\d\.]+)\.", ll)
if (rr):
ss = rr.group("ss")
if (ss != subnet):
ss = None
rr = re.match(r"\s*(?P<kk>[^=]+) = (?P<vv>[\d]+)", ll)
if (ss and rr):
if (rr.group("kk") == "No. of Addresses in use"):
used = int(rr.group("vv"))
if (rr.group("kk") == "No. of free Addresses"):
free = int(rr.group("vv"))

Now we calculate the value we're looking for -- the percentage of address space that's currently being leased:
total = used + free
if (total > 0):
used_percent = float(used) / total
detail = "%d of %d addresses in use; %d%% utilization" % (used, total, used_percent * 100)

Finally, we compare the current status data with the threshholds we've set, and set the output status variable accordingly, build a status message, and look up the return code in the STATUS array:
if (used_percent >= crit_level):
status = "CRITICAL"
elif (used_percent >= warn_level):
status = "WARNING"
elif (used_percent == 0):
status = "UNKNOWN"
else:
status = "OK"

print "%s %s: (%s) %s" % (service, status, subnet, detail)
sys.exit(STATUS.index(status))

And that's all there is to it. This script returns status messages like the following:
DHCP OK: (150.135.220.0) 229 of 505 addresses in use: 45% utilization

As you've probably noticed, about half of this script is fairly standard structural stuff which translates pretty well to most other Nagios check scripts; the DHCP-specific code all happens in the middle of the script. Below is a script for checking that a Solaris LDAP server is working that uses pretty much the same template, but is stripped down even further:
#!/usr/bin/env python
import os, sys, re

STATUS = ["OK", "WARNING", "CRITICAL", "UNKNOWN"]
status = "OK"
service = "LDAP"

check_cmd = r'/usr/bin/ldaplist passwd nagios'
detail = os.popen3(check_cmd)[1].readline().rstrip()

if (detail.find("uid=nagios") >= 0):
status = "OK"
else:
status = "CRITICAL"

print "%s %s: %s" % (service, status, detail)
sys.exit(STATUS.index(status))

Lastly, here's a script (which I'm pretty sure must have been written by one of my coworkers) that uses python's subprocess module to pipe commands together, which allows for some more complex processing of status information:
#!/usr/bin/env python
import subprocess, sys

STATUS = ["OK", "WARNING", "CRITICAL", "UNKNOWN"]
status = "OK"
service = "Mailman"

p1 = subprocess.Popen(["/usr/bin/ps", "-eaf"], stdout=subprocess.PIPE)
p2 = subprocess.Popen(["grep", "[m]ailmanctl"], stdin=p1.stdout, stdout=subprocess.PIPE)
return_code = p2.wait()

if (return_code == 0):
status = "OK"
detail = "Master daemon found."
else:
status = "CRITICAL"
detail = "Master daemon not found."

print "%s %s: %s" % (service, status, detail)
sys.exit(STATUS.index(status))

Hopefully, this sampling of scripts gives you a good starting point for whipping up some more scripts that are helpful in your own environment; if you come up with anything particularly innovative or interesting, I'd love so hear about it!

0 Comments:

Post a Comment

Subscribe to Post Comments [Atom]

<< Home