Nagios Troubleshooting Guide
This guide addresses common issues related to the Nagios integration. Depending on your integration type, you may run into errors specific to your environment:
General Configuration
If Nagios notifications are not triggering PagerDuty incidents as expected, the following items apply to all integration types.
Your Nagios host or service may not be reaching a HARD down state
Events are only sent to PagerDuty when your service or host changes state to HARD
. Typically, a host or service will first enter a SOFT
state, and only transition to HARD
after it reaches its max_check_attempts
limit.
For more information, please see Nagios’ State Types documentation.
To verify whether this is happening:
- Check your logs.
- Debian/Ubuntu:
/var/log/syslog
- RHEL/CentOS:
/var/log/messages
- Debian/Ubuntu:
- Run
grep pagerduty <log path>
to see notifications sent to PagerDuty.
This is an example of a SOFT
down, which would not trigger an incident in PagerDuty:
Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE ALERT: localhost;Current Users;WARNING;SOFT;1;USERS WARNING - 2 users currently logged in
This is an example of a HARD
down, which should trigger incidents in PagerDuty:
Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: pagerduty;localhost;Current Users;WARNING;notify-service-by-pagerduty;USERS WARNING - 3 users currently logged in
Confirm that your PagerDuty contact is configured properly
The pagerduty
contact might not have been configured to receive notifications properly.
To check this, run grep NOTIFICATION <log path>
.
If, as in the example below, "pagerduty" is not listed in your logs, check to make sure that the pagerduty
contact is included in the contact group, which is configured to receive notifications under the service or host template:
Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: root;localhost;Current Users;CRITICAL;notify-service-by-email;USERS CRITICAL - 5 users currently logged in
Nagios XI vs. Nagios Core file paths
If you use Nagios XI, paths will differ from Nagios Core. Additionally, configuration is managed primarily through the Nagios XI web interface, as opposed to Nagios Core’s configuration files. Please refer to the Nagios XI Integration Guide for further details.
If you use the default configuration, open the file that contains the pagerduty
contact to confirm it is included in the correct contact group:
- Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/conf.d/contacts_nagios2.cfg
- RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/objects/contacts.cfg
If you use the default configuration, open the following file to make sure that the pagerduty
contact is defined properly.
- Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/conf.d/pagerduty_nagios.cfg
- RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/objects/pagerduty_nagios.cfg
If you use the default configuration, open the following file to confirm that the host or service template being used is contacting the correct group.
- Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/conf.d/generic-service_nagios.cfg
,/etc/nagios3/conf.d/generic-host_nagios2.cfg
- RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/objects/generic-service_nagios2.cfg
,/ect/nagios/objects/generic-host_nagios2.cfg
If you make any changes to the templates above, make sure to restart Nagios:
/etc/init.d/nagios3 restart
or
service nagios3 restart
[ERROR] NOTIFICATIONTYPE field must be present
The PagerDuty integration accepts the following Nagios notifications:
PROBLEM
ACKNOWLEDGE
RECOVERY
Other event types (e.g., FLAPPINGSTART
and FLAPPINGSTOP
) are not supported, and will result in a NOTIFICATIONTYPE
error.
Please also note that sending a custom notification manually through the Nagios UI will not trigger an incident, as the integration does not support custom notifications.
If you are using the agentless integration and would like to receive FLAPPINGSTART
and FLAPPINGSTOP
events, you can update the enqueue_event
subroutine in the pagerduty_nagios.pl
script (below line 235):
if ($event{"NOTIFICATIONTYPE"} eq "FLAPPINGSTART") {
$event{"NOTIFICATIONTYPE"} = "PROBLEM";
}
if ($event{"NOTIFICATIONTYPE"} eq "FLAPPINGSTOP") {
$event{"NOTIFICATIONTYPE"} = "RECOVERY";
}
Make sure that you have enabled flapping notifications in your pagerduty_nagios.cfg
file under the service_notification_options
and/or host_notification_options
fields.
Perl-Based Integration
Tip
Use the Perl integration if you use CentOS 5 or lower.
Trigger a test incident to make sure that the Perl script will run
- Manually trigger a Nagios incident with the Perl script to make sure it runs.
- Make sure that you are logged in as the Nagios user, or add
sudo -u nagios
to your command.- If you are logged in as the user that runs Nagios (typically the "nagios" user), you can omit this from your commands.
[ERROR] Nagios event in file /tmp/pagerduty_nagios/pd_12334543223_1235.txt DEFERRED due to network/server problems.
If your server is behind a proxy, you will need to specify it when executing the Perl script. Add the following switch to the Nagios command that calls the script, as well as your cron job:
--proxy https://my.proxy.com:<port>
Also, verify that the Perl libraries for SSL are installed (typically step 1 of the integration guide).
- For Debian-based systems (i.e., Ubuntu):
aptitude install libwww-perl libcrypt-ssleay-perl
- For RHEL-based systems (i.e., CentOS, Fedora):
yum install perl-libwww-perl perl-Crypt-SSLeay
Then run the following:
sudo -u nagios <path to perl script> flush --verbose
If you get a 500
response of Can't verify SSL peers without knowing which Certificate Authorities to trust
, install the Mozilla::CA module by running the following command:
cpanm Mozilla::CA
[ERROR] May 16 07:12:46 sw-cloud pagerduty_nagios[32356]: open /tmp/pagerduty_nagios/pd_1337123566_32999.txt for write failed: Illegal seek
This error means that the user running Nagios does not have write permissions to the /tmp/pagerduty_nagios/
directory. The easiest solution to fix this is to delete the directory. Note, this will remove any queued alerts:
rm -rf /tmp/pagerduty_nagios
[ERROR] File was rejected because could not find CONTACTPAGER
If you see this error, you will need to enable environment variables by setting the following enable_environment_macros=1
in your nagios.cfg
file:
- Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/nagios.cfg
- RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/nagios.cfg
Agent-Based Integration
Below are some issues that may arise with an agent-based integration while using the PagerDuty Agent.
Trigger a test incident to make sure that the agent works
Manually trigger a Nagios incident with the pd-send
command to make sure the agent is working.
Replace YOUR-INTEGRATION-KEY-HERE
with your actual integration key in the commands below:
sudo -u nagios /usr/share/pdagent-integrations/bin/pd-nagios -n service -k YOUR-INTEGRATION-KEY-HERE -t "PROBLEM" -f SERVICEDESC="test_description" -f SERVICESTATE="CRITICAL" -f HOSTNAME="test_host_name" -f SERVICEOUTPUT="test_service_output"
Alternatively, you can use the pd-send
command to trigger an incident.
Here is an example event to trigger an incident using pd-send:
~$ export PD_INTEGRATION_KEY=YOUR-INTEGRATION-KEY-HERE
~$ pd-send -k $YOUR-INTEGRATION-KEY-HERE -t trigger -d "Server is on fire" -i server.fire
Event processed. Incident Key: server.fire
[ERROR] Error Performing CheckSum
This is an installation error on CentOS 5 and below. The agent supports CentOS 6 and higher. If you are running CentOS 5 or below, please use the Agentless Nagios Integration Guide.
Agent is not running
Check to make sure that the PD agent is running. To do this, run the following command:
service pdagent status
If the status is "not running", then start the PD agent with the following command:
service pdagent start
Outdated agent version
If you see something similar to the following in your logs, then you will need to update to the latest version of the agent:
09:36 | [1417765072] wproc: stderr line 01: Traceback (most recent call last):
[1417765072] wproc: stderr line 02: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 188, in <module>
[1417765072] wproc: stderr line 03: main()
[1417765072] wproc: stderr line 04: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 117, in main
[1417765072] wproc: stderr line 05: details = parse_fields(args.fields)
[1417765072] wproc: stderr line 06: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 177, in parse_fields
[1417765072] wproc: stderr line 07: return dict(f.split("=", 2) for f in fields)
Bi-Directional Integration
The bidirectional integration utilizes a CGI script to capture webhooks and process them into commands that Nagios runs, to add the acknowledgment note.
You may wish to capture an incident acknowledgment webhook for iterative testing and log-checking, for example, by sending it manually via curl or Postman. You can do this by creating a webhook on your PagerDuty service and pointing it to a temporary pipedream.com URL to capture the JSON body, and acknowledging an incident that was raised from Nagios.
Once you have the JSON content of the webhook, you will be able to send the same webhook after each change and troubleshooting step attempted, without having to repeat the full process of raising an alert in Nagios and acknowledging it in PagerDuty. This allows more rapid testing and diagnosis of the CGI script that processes webhooks from PagerDuty.
CGI script cannot execute
Once you have put the script in place, try opening it in a HTTP client, for example, Perl or a web browser, with a GET
request. You should receive a 400
error along with the message: 400 Requests must be POSTs
Response is 403 Forbidden
The pagerduty.cgi
script must be readable and executable by the web server process. If the process cannot read and execute the script file, it will in most cases respond to the request with a 403
status.
Response is 401 Unauthorized
The script, or the directory it is in, may require authorization (e.g., HTTP Basic Auth). Check with your system administrator to see if this is the case. If HTTP Basic Auth is used, retry your GET request with username:password@
prepended to the host name in the URL (i.e., immediately following https://
).
Response is 500 Internal Server Error
This indicates that the script itself is exiting prematurely with a non-success status due to an uncaught exception. The following dependencies (i.e., Perl modules) must be installed for the script to run properly:
JSON
LWP::UserAgent
The Nagios Integration Guide outlines how to install these modules using native package management in CentOS and Ubuntu.
If you have verified that dependencies are met and you still receive a 500
status response, try running the script from the command line to see what error results in the output. There may be an issue with the Perl installation on the local machine, or a syntax error in the script caused by an accidental modification that resulted in invalid Perl syntax (e.g., missing a semicolon at the end of a line).
CGI script executes, but no notes are added to the Nagios alert
The CGI script writes to an "external commands" file that Nagios reads. This is the step when the PagerDuty incident acknowledgment is translated from a webhook into an action that Nagios takes (i.e., adding a note to the alert that it has been acknowledged in PagerDuty).
There are a few issues that could prevent this process from happening properly:
- Permissions on the command file/directory where it resides.
- The command file's path is in a different location than what is configured in the default Nagios installation.
- Nagios might not be configured to execute external commands.
In the Nagios configuration specification (per the documentation on configuring external commands), the two directives check_external_commands
and command_file
are particularly helpful when troubleshooting the item above. The latter determines the path where the command file resides.
If you can verify that external commands are enabled in Nagios, per the check_external_commands
option, and can obtain the path from the command_file
option, then you can then check that against the path that is hard-coded in the CGI script, on line 14:
'command_file' => '/var/lib/nagios3/rw/nagios.cmd', # External commands file
File Permissions
There could also be an issue related to the command file's permissions, in which case you will need to check to see what user ID is running the script, and ensure it has write permission to the command file.
Updated about 2 months ago