Nagios Troubleshooting Guide

Below are some sample errors that you may run into when integrating Nagios with PagerDuty, and steps for troubleshooting those errors. We've split up this article into three sections depending on the type of configuration used to integrate PagerDuty and Nagios:

Potential Issues with the General Configuration

Nagios did not trigger a PagerDuty incident: possible causes

Your Nagios host or service is not reaching a HARD down state

Events are sent to PagerDuty once your service or host has a HARD State Type and the state changes. This happens once the max_check_attempts limit has been reached for the host/service.

In other words, an event would be sent to PagerDuty once your service or host goes up or down HARD. For more information, please see this Nagios documentation on states.

To verify if this is happening, check your logs:

Debian/Ubuntu: /var/log/syslog

RHEL/CentOS: /var/log/messages

Run grep pagerduty <log path> to see notifications sent to PagerDuty.

This is an example of a SOFT down. This would not trigger an incident in PagerDuty:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE ALERT: localhost;Current Users;WARNING;SOFT;1;USERS WARNING - 2 users currently logged in

This is an example of a HARD down, which should trigger incidents in PagerDuty:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: pagerduty;localhost;Current Users;WARNING;notify-service-by-pagerduty;USERS WARNING - 3 users currently logged in

Your PagerDuty contact is not configured properly

The PagerDuty contact might not have been configured properly to receive notifications.

To check this, run grep NOTIFICATION <log path>.

If, as in the example below, "pagerduty" is not listed in your logs, check to make sure that the pagerduty contact is included in the contact group which is configured to receive notifications under the service or host template:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: root;localhost;Current Users;CRITICAL;notify-service-by-email;USERS CRITICAL - 5 users currently logged in

If you’re using the default configuration, open the following to confirm that the pagerduty contact is included in the correct contact group.

Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/conf.d/contacts_nagios2.cfg
RHEL, Fedora, CentOS, and other Redhat-derived systems: /etc/nagios/objects/contacts.cfg

If you’re using the default configuration, open the following to make sure that the pagerduty contact itself is defined properly.

Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/conf.d/pagerduty_nagios.cfg
RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/objects/pagerduty_nagios.cfg

If you’re using the default configuration, open the following to confirm that the host or service template being used is contacting the correct group.

Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/conf.d/generic-service_nagios.cfg
/etc/nagios3/conf.d/generic-host_nagios2.cfg
RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/objects/generic-service_nagios2.cfg
/ect/nagios/objects/generic-host_nagios2.cfg

If you do make any changes to the templates above, make sure to restart Nagios:

/etc/init.d/nagios3 restart
or
service nagios3 restart

[ERROR] NOTIFICATIONTYPE field must be present

The PagerDuty integration can only accept PROBLEM, ACKNOWLEDGE, and RECOVERY notifications. If you see this error, then a different type of event is being generated — such as FLAPPINGSTART and FLAPPINGSTOP — which is not supported by the integration and will be ignored.

Also, please note that sending a custom notification manually through the Nagios UI will not trigger an incident, as custom notifications are not supported by the integration.

If you are using the Perl integration and would like to receive FLAPPINGSTART and FLAPPINGSTOP events, you can add this script to your integration.

Potential Issues with Perl-based Integration

Note

Use the Perl integration if you are using CentOS 5 or lower.

Trigger a test incident to make sure that the Perl script will run

Make sure that you are logged in as the Nagios user, or add sudo -u nagios to your command. If you're already logged in as the user that's running Nagios (typically the "nagios" user), you can omit this from each command.

Manually trigger a Nagios incident with the Perl script to make sure it will run.

[ERROR] Nagios event in file /tmp/pagerduty_nagios/pd_12334543223_1235.txt DEFERRED due to network/server problems.

Is the server behind a proxy? If so, it needs to be specified when executing the Perl script. Add the following switch to the Nagios command that calls the script, as well as your cron job:

--proxy https://my.proxy.com:<port>

Also, verify that the Perl libraries for SSL are installed (typically step 1 of the integration guide).

For Debian-based systems (i.e. Ubuntu):
aptitude install libwww-perl libcrypt-ssleay-perl

For RHEL-based systems (i.e. CentOS, Fedora):
yum install perl-libwww-perl perl-Crypt-SSLeay

Then, run:
sudo -u nagios <path to perl script> flush --verbose

If you get a 500 response of "Can't verify SSL peers without knowing which Certificate Authorities to trust", install the Mozilla::CA module by running the following command:

cpanm Mozilla::CA

[ERROR] May 16 07:12:46 sw-cloud pagerduty_nagios[32356]: open /tmp/pagerduty_nagios/pd_1337123566_32999.txt for write failed: Illegal seek

This error means that the user running Nagios does not have write permissions to the /tmp/pagerduty_nagios/ directory. The easiest solution to fix this is to delete the directory. Note, this will remove any and all queued alerts:

rm -rf /tmp/pagerduty_nagios

[ERROR] File was rejected because could not find CONTACTPAGER

If you see this error, you will need to enable environment variables by setting the following enable_environment_variables=1 in your nagios.cfg file:

Debian, Ubuntu, and other Debian-derived systems:
/etc/nagios3/nagios.cfg

RHEL, Fedora, CentOS, and other Redhat-derived systems:
/etc/nagios/nagios.cfg

Potential Issues with Agent-based Integration

Below are some issues that may arise with an agent-based integration while using the PagerDuty agent.

Trigger a test incident to make sure that the agent works

Manually trigger a Nagios incident with the pd-send command to make sure the agent is working.

Replace "YOUR-INTEGRATION-KEY-HERE" with your actual integration key in the below commands:
sudo -u nagios /usr/share/pdagent-integrations/bin/pd-nagios -n service -k YOUR-INTEGRATION-KEY-HERE -t "PROBLEM" -f SERVICEDESC="test_description" -f SERVICESTATE="CRITICAL" -f HOSTNAME="test_host_name" -f SERVICEOUTPUT="test_service_output"

Alternatively, you can use the pd-send command to trigger an incident.

Here is an example event to trigger an incident using pd-send:

~$ export PD_INTEGRATION_KEY=YOUR-INTEGRATION-KEY-HERE
~$ pd-send -k $YOUR-INTEGRATION-KEY-HERE -t trigger -d "Server is on fire" -i server.fire
Event processed. Incident Key: server.fire

[ERROR] Error Performing CheckSum

This is an installation error on CentOS 5 and below. Only CentOS 6 and above are supported by the agent. If you are running CentOS 5 or below, then you will need to follow the Perl script integration guide located here.

Agent is not running

Check to make sure that the PD agent is running. To do this, run the following command:
service pdagent status

If the status is "not running", then start the PD agent:
service pdagent start

[LOGS] :

[1417765072] wproc: stderr line 04: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 117, in main
[1417765072] wproc: stderr line 05: details = parse_fields(args.fields)

If you see something similar to the following in your logs, then you will need to update the agent to the latest version of the agent.

09:36 | [1417765072] wproc: stderr line 01: Traceback (most recent call last): 
[1417765072] wproc: stderr line 02: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 188, in <module> 
[1417765072] wproc: stderr line 03: main() 
[1417765072] wproc: stderr line 04: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 117, in main 
[1417765072] wproc: stderr line 05: details = parse_fields(args.fields) 
[1417765072] wproc: stderr line 06: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 177, in parse_fields 
[1417765072] wproc: stderr line 07: return dict(f.split("=", 2) for f in fields)

Nagios Troubleshooting Guide