Grid Monitoring With Nagios: Aries Hung, Joanna Huang, Felix Lee, Min Tsai Asgc WLCG T2 Asia Workshop TIFR, Dec 2, 2006
Grid Monitoring With Nagios: Aries Hung, Joanna Huang, Felix Lee, Min Tsai Asgc WLCG T2 Asia Workshop TIFR, Dec 2, 2006
1
Agenda
• Nagios Overview
• Nagios Installation and Configuration
• Plugin Development
• ASGC Plugins
• SMS System
2
Grid Monitoring
3
Nagios Overview and Features I
5
Nagios Requirements
6
Nagios: Server Installation (1/3)
• Acquire the following latest packages from http://www.nagios.org/download/
• nagios-2.6.tar.gz
• nagios-plugins-1.4.5.tar.gz
• Create the necessary directories, permissions and user accounts to run Nagios
root@nagios ~]# useradd nagios
root@nagios ~]# mkdir /usr/local/nagios
root@nagios ~]# mkdir /usr/local/nagios/libexec
root@nagios ~]# chown -R nagios:nagios /usr/local/nagios
root@nagios ~]# groupadd nagcmd
root@nagios ~]# usermod –G nagcmd apache
root@nagios ~]# usermod –G nagcmd nagios
root@nagios ~]# chgrp –R nagcmd /usr/local/nagios/var/rw
7
Nagios: Server Installation (2/3)
8
Nagios: Server Installation (3/3)
9
Nagios: Server Configuration (1/5)
• Create a file named /usr/local/nagios/etc/nagios-server.conf and insert the following into that
file:
ScriptAlias /nagios/cgi-bin "/usr/local/nagios/sbin/"
<Directory "/usr/local/nagios/sbin/">
Options ExecCGI
AllowOverride None
Order allow,deny
Allow from all
AuthName "Nagios Access"
AuthType Basic
AuthUserFile /usr/local/nagios/etc/htpasswd.users
Require valid-user
</Directory>
10
Nagios: Server Configuration (2/5)
• Create a ‘nagiosadmin’ user account that will be used when prompted for
authentication when accessing the Nagios web page
root@nagios nagios-plugins-1.4.5]# htpasswd -c
/usr/local/nagios/etc/htpasswd.users nagiosadmin
11
Nagios: Server Configuration (3/5)
• Make the sample config files be your actual configuration files for Nagios
root@nagios etc]# mv checkcommand.cfg-sample checkcommands.cfg
root@nagios etc]# mv minimal.cfg-sample minimal.cfg
root@nagios etc]# mv misccommands.cfg-sample misccommands.cfg
root@nagios etc]# mv nagios.cfg-sample nagios.cfg
root@nagios etc]# mv resource.cfg-sample resource.cfg
root@nagios etc]# rm bigger.cfg-sample
12
Nagios: Server Configuration (4/5)
• Also change the below line in the above file to avoid the service
reporting Total Processes UNKNOWN error on the web UI
command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$ -
s $ARG3$
to
command_line $USER1$/check_procs -w $ARG1$ -c $ARG2$
• Restart Apache
root@nagios etc]# service httpd restart
15
Nagios NRPE: Client Installation (2/2)
16
Nagios NRPE: Client Configuration
• Make the user account and set the permission on the directory where you installed the
NRPE client to
root@nagiosclient ~]# useradd nagios
root@nagiosclient ~]# chown –R nagios /usr/local/nagios
18
Nagios NRPE:
Server Configuration (1/2)
• Add the services to the ‘services’ section in /usr/local/nagios/minimal.cfg file, e.g.
define service{
use generic-service ; service template
host_name nagiosclient
service_description PING
is_volatile 0
check_period 24x7
max_check_attempts 4
normal_check_interval 5
retry_check_interval 1
contact_groups admins
notification_options w,u,c,r
notification_interval 960
notification_period 24x7
check_command check_nrpe!check_local_disk!20%!10%!/
}
• Troubleshooting:
root@nagios nagios-plugins-1.4.3]# /usr/local/nagios/bin/nagios -v
/usr/local/nagios/etc/nagios.cfg
• It will tell you which file and what line nagios has a problem with when it won’t run
19
Developing Nagios Plugins (1/2)
• Nagios will only grab the first line of text from STDOUT
• Stay within 80 characters
• This will be used for text messages or paging
• All ASGC plugins write result in log file for additional error messages
• Testing plugin
• Add –v option for increased verbosity
• Create unit test to simulate failures when the don’t exist
20
Developing Nagios Plugins (2/2)
• Return Codes:
• 0: OK
• 1: Warning
• 2: Critical
• 3: Unknown – low level internal plugin errors (invalid arguments)
• Standard Options
• List of standard options to give nagios plugins a more consistent interface
• -H hostname, -t timeout, …etc.
• http://nagiosplug.sourceforge.net/developer-guidelines.html#AEN304
• Document Plugin
• List user requirements for plugins
• Tests executed by plugin
• Specify plugin arguments and usage information
21
Nagios Plugins from ASGC (1/2)
• init_vomsproxy
• Checks voms-proxy-init by creating a proxy on the Nagios host for GRID
access
• check_CE
• Checks globus-job-run by issuing job request to CE host to test functionality
• check_GridFTP
• Checks functionality of GRID ftp services for given host by copying a test file
and then deleting it
• check_LFC
• Checks GRID Information Provider
• Checks Catalog functionality
• Checks copy-register (lcg-cr) functionality
22
Nagios Plugins from ASGC (2/2)
• check_SRM
• Checks functionality of SRM services for specified host by
copying a test file and then deleting it
• check_GStatUpdate
• check if GStat is being updated on a timely basis
• check_HostCert
• Check if the host public certificate is valid against the trusted
CAs
• Check if host certificate is about to expire
23
NRPE Plugins from ASGC
• check_TimeSync
• Uses the ntpdate program to obtain the date and time for the
given NTP server query
• Generate an alert if time offset is above one of the warning and
critical threshold values
• If time is not in sync, then GSI security will fail
• check_CApkg
• Checks to see if CA packages are up-to-date
24
Installing ASGC Plugins on
Nagios Server
• Installation and Configuration on the Nagios server
• Installation of UI software
25
Installing ASGC Plugins on
NRPE Client
• The following ASGC plugins (implemented in Python) are currently available
check_TimeSync.py check_CApkg.py check_HostCert.py
26
Plugin Troubleshooting
• Service check timed out
• Nagios plugin:
• reset the service_check_timeout value on all service checks that run (nagios.cfg)
• NRPE plugin:
• reset the check_nrpe -t timeout to more seconds to see if it goes away (checkcommands.cfg or )
• Wrong environment variables lead to the wrong path to use for SRM checks
• Grid ftp service checking failed on TW-FTT DPM hosts that reported the error
message about processing certificate
• Issue with voms proxy
• allows you to create proxies with long lifetimes
• but the extension information only shows 24 hours
• Make the lifetimes of proxy to be less than 24 hours and then the problem goes away
• Proxy problems
• Proxy is not valid long enough (3 hours) to run globus jobs for CE checking
• Re-init proxy when life time is less than or equal to 3 hours
• Unsymmetrical system time between checked host and Nagios host
27
SMS System
• Short Message Service (SMS) can send and receive short messages through GSM modems or
mobile phones
• Using SMS service for Nagios contact notifications when service or host problems occur
• Properly set the thresholds for notifications to send sms with nagios
• Sending SMS with Nagios is based on the misccommands.cfg, you have to define a command,
which talks to your sms-notification-software such as sendsms or sms_client
• 24x7 operations centers can utilize Nagios with SMS to manage grid resources on a more effective
and efficient way
28
Thanks for Your Attention
29
Reference Links
• Download Nagios
• http://www.nagios.org/download/
• Nagios Documentation
• http://www.nagios.org/docs/
• Plug-in development guidelines
• http://nagiosplug.sourceforge.net/developer-guidelines.html
• Nagios Screenshots
• http://www.nagios.org/about/screenshots.php
• Nagios FAQ
• http://www.nagios.org/faqs/
• The 3rd Party Plugin Repository
• http://www.nagiosexchange.org/
30