Nagios Troubleshooting Guide

This guide addresses common issues related to the Nagios integration. Depending on your integration type, you may run into errors specific to your environment:

General Configuration

If Nagios notifications are not triggering PagerDuty incidents as you expect, the following items are common causes, regardless of your installation type.

Your Nagios host or service may not be reaching a HARD down state

Events are only sent to PagerDuty when your service or host changes state to HARD. Typically, a host or service will first enter a SOFT state, and only transition to HARD after it reaches its max_check_attempts limit.

For more information, please see Nagios’ State Types documentation.

To verify whether this is happening:

  1. Check your logs.
    • Debian/Ubuntu: /var/log/syslog
    • RHEL/CentOS: /var/log/messages
  2. Run grep pagerduty <log path> to see notifications sent to PagerDuty.

This is an example of a SOFT down, which would not trigger an incident in PagerDuty:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE ALERT: localhost;Current Users;WARNING;SOFT;1;USERS WARNING - 2 users currently logged in

This is an example of a HARD down, which should trigger incidents in PagerDuty:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: pagerduty;localhost;Current Users;WARNING;notify-service-by-pagerduty;USERS WARNING - 3 users currently logged in

Confirm that your PagerDuty contact is configured properly

The pagerduty contact might not have been configured to receive notifications properly.

To check this, run grep NOTIFICATION <log path>.

If, as in the example below, "pagerduty" is not listed in your logs, check to make sure that the pagerduty contact is included in the contact group, which is configured to receive notifications under the service or host template:

Nov 13 22:34:30 ip-10-182-165-131 nagios3: SERVICE NOTIFICATION: root;localhost;Current Users;CRITICAL;notify-service-by-email;USERS CRITICAL - 5 users currently logged in

📘

Nagios XI vs. Nagios Core file paths

If you’re using Nagios XI, paths will differ from Nagios Core. Furthermore, configuration is managed primarily through the Nagios XI web interface, as opposed to Nagios Core’s configuration files. Please refer to the Nagios XI Integration Guide for further details.

If you’re using the default configuration, open the file containing the pagerduty contact to confirm it is included in the correct contact group:

  • Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/conf.d/contacts_nagios2.cfg
  • RHEL, Fedora, CentOS, and other Redhat-derived systems: /etc/nagios/objects/contacts.cfg

If you’re using the default configuration, open the following to make sure that the pagerduty contact is defined properly.

  • Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/conf.d/pagerduty_nagios.cfg
  • RHEL, Fedora, CentOS, and other Redhat-derived systems: /etc/nagios/objects/pagerduty_nagios.cfg

If you’re using the default configuration, open the following to confirm that the host or service template being used is contacting the correct group.

  • Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/conf.d/generic-service_nagios.cfg, /etc/nagios3/conf.d/generic-host_nagios2.cfg
  • RHEL, Fedora, CentOS, and other Redhat-derived systems: /etc/nagios/objects/generic-service_nagios2.cfg, /ect/nagios/objects/generic-host_nagios2.cfg

If you make any changes to the templates above, make sure to restart Nagios:

/etc/init.d/nagios3 restart
or
service nagios3 restart

[ERROR] NOTIFICATIONTYPE field must be present

The PagerDuty integration only accepts PROBLEM, ACKNOWLEDGE and RECOVERY notifications. Other event types — such as FLAPPINGSTART and FLAPPINGSTOP — are not supported, and will result in a NOTIFICATIONTYPE error.

Please also note that sending a custom notification manually through the Nagios UI will not trigger an incident, as the integration does not support custom notifications.

If you are using the agentless integration and would like to receive FLAPPINGSTART and FLAPPINGSTOP events, you can update the enqueue_event subroutine in the pagerduty_nagios.pl script (below line 235):

if ($event{"NOTIFICATIONTYPE"} eq "FLAPPINGSTART") {
    $event{"NOTIFICATIONTYPE"} = "PROBLEM";
   }
if ($event{"NOTIFICATIONTYPE"} eq "FLAPPINGSTOP") {
    $event{"NOTIFICATIONTYPE"} = "RECOVERY";
   }

Make sure that you have enabled flapping notifications in your pagerduty_nagios.cfg file under the service_notification_options and/or host_notification_options fields.

Perl-Based Integration

📘

Tip

Use the Perl integration if you are using CentOS 5 or lower.

Trigger a test incident to make sure that the Perl script will run

Manually trigger a Nagios incident with the Perl script to make sure it will run. Make sure that you are logged in as the Nagios user, or add sudo -u nagios to your command. If you're logged in as the user that's running Nagios (typically the "nagios" user), you can omit this from your commands.

[ERROR] Nagios event in file /tmp/pagerduty_nagios/pd_12334543223_1235.txt DEFERRED due to network/server problems.

Is the server behind a proxy? If so, it needs to be specified when executing the Perl script. Add the following switch to the Nagios command that calls the script, as well as your cron job:

--proxy https://my.proxy.com:<port>

Also, verify that the Perl libraries for SSL are installed (typically step 1 of the integration guide).

  • For Debian-based systems (i.e., Ubuntu): aptitude install libwww-perl libcrypt-ssleay-perl
  • For RHEL-based systems (i.e., CentOS, Fedora): yum install perl-libwww-perl perl-Crypt-SSLeay

Then, run:
sudo -u nagios <path to perl script> flush --verbose

If you get a 500 response of "Can't verify SSL peers without knowing which Certificate Authorities to trust", install the Mozilla::CA module by running the following command:

cpanm Mozilla::CA

[ERROR] May 16 07:12:46 sw-cloud pagerduty_nagios[32356]: open /tmp/pagerduty_nagios/pd_1337123566_32999.txt for write failed: Illegal seek

This error means that the user running Nagios does not have write permissions to the /tmp/pagerduty_nagios/ directory. The easiest solution to fix this is to delete the directory. Note, this will remove any queued alerts:

rm -rf /tmp/pagerduty_nagios

[ERROR] File was rejected because could not find CONTACTPAGER

If you see this error, you will need to enable environment variables by setting the following enable_environment_macros=1 in your nagios.cfg file:

  • Debian, Ubuntu, and other Debian-derived systems: /etc/nagios3/nagios.cfg
  • RHEL, Fedora, CentOS, and other Redhat-derived systems: /etc/nagios/nagios.cfg

Agent-Based Integration

Below are some issues that may arise with an agent-based integration while using the PagerDuty Agent.

Trigger a test incident to make sure that the agent works

Manually trigger a Nagios incident with the pd-send command to make sure the agent is working.

Replace YOUR-INTEGRATION-KEY-HERE with your actual integration key in the commands below:
sudo -u nagios /usr/share/pdagent-integrations/bin/pd-nagios -n service -k YOUR-INTEGRATION-KEY-HERE -t "PROBLEM" -f SERVICEDESC="test_description" -f SERVICESTATE="CRITICAL" -f HOSTNAME="test_host_name" -f SERVICEOUTPUT="test_service_output"

Alternatively, you can use the pd-send command to trigger an incident.

Here is an example event to trigger an incident using pd-send:

~$ export PD_INTEGRATION_KEY=YOUR-INTEGRATION-KEY-HERE
~$ pd-send -k $YOUR-INTEGRATION-KEY-HERE -t trigger -d "Server is on fire" -i server.fire
Event processed. Incident Key: server.fire

[ERROR] Error Performing CheckSum

This is an installation error on CentOS 5 and below. Only CentOS 6 and above are supported by the agent. If you are running CentOS 5 or below, please use the Agentless Nagios Integration Guide.

Agent is not running

Check to make sure that the PD agent is running. To do this, run the following command:
service pdagent status

If the status is "not running", then start the PD agent:
service pdagent start

Outdated agent version

If you see something similar to the following in your logs, then you will need to update to the latest version of the agent:

09:36 | [1417765072] wproc: stderr line 01: Traceback (most recent call last): 
[1417765072] wproc: stderr line 02: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 188, in <module> 
[1417765072] wproc: stderr line 03: main() 
[1417765072] wproc: stderr line 04: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 117, in main 
[1417765072] wproc: stderr line 05: details = parse_fields(args.fields) 
[1417765072] wproc: stderr line 06: File "/usr/share/pdagent-integrations/bin/pd-nagios", line 177, in parse_fields 
[1417765072] wproc: stderr line 07: return dict(f.split("=", 2) for f in fields)

Bi-Directional Integration

The bidirectional integration utilizes a CGI script to capture webhooks and process them into commands that Nagios runs, to add the acknowledgment note.

You may wish to capture an incident acknowledgment webhook for iterative testing and log-checking, for example, by sending it manually via curl or Postman. You can do this by creating an extension on your PagerDuty service and pointing it to a temporary hookbin.com URL to capture the JSON body, and acknowledging an incident that was raised from Nagios.

Once you have the JSON content of the webhook, you will be able to send the same webhook after each change and troubleshooting step attempted, without having to repeat the full process of raising an alert in Nagios and acknowledging it in PagerDuty. This allows more rapid testing and diagnosis of the CGI script that processes webhooks from PagerDuty.

CGI script cannot execute

Once you have put the script in place, try opening it in a HTTP client, for example, Perl or a web browser, with a GET request. You should receive a 400 error along with the message:

400 Requests must be POSTs

Response is 403 Forbidden

The pagerduty.cgi script must be readable and executable by the web server process. If the process cannot read and execute the script file, it will in most cases respond to the request with a 403 status.

Response is 401 Unauthorized

The script, or the directory it is in, may require authorization, i.e., HTTP Basic Auth. Check with the system administrator to see if this is the case. If HTTP Basic Auth is used, retry your GET request with username:password@ prepended to the host name in the URL (immediately following http(s)://).

Response is 500 Internal Server Error

This indicates that the script itself is exiting prematurely with a non-success status due to some uncaught exception. The following dependencies (Perl modules) must be installed for the script to run properly:

  • JSON
  • LWP::UserAgent

The Nagios Integration Guide outlines how to install these modules using native package management in CentOS and Ubuntu.

If you have verified that dependencies are met and you still receive a 500 status response, try running the script from the command line to see what error results in the output. There may be an issue with the Perl installation on the local machine, or a syntax error in the script caused by an accidental modification that resulted in invalid Perl syntax (i.e., missing a semicolon at the end of a line).

CGI script executes, but no notes are being added to the Nagios alert

The CGI script writes to an "external commands" file that is read by Nagios. This is the step when the PagerDuty incident acknowledgment is translated from a webhook into an action taken by Nagios (adding a note to the alert that it has been acknowledged in PagerDuty).

There are a few things that could prevent this process from happening properly:

  • Permissions on the command file/directory where it resides.
  • The command file is at a different path than what is configured in the default Nagios installation, which is what the script was configured for by default.
  • Nagios might not be configured to execute external commands.

In the Nagios configuration specification (per the documentation on configuring external commands), the two directives check_external_commands and command_file are of particular interest for troubleshooting the above. The latter dictates the path at which the command file resides.

If you can verify that external commands are enabled in Nagios, per the check_external_commands option and can obtain the path from the command_file option, then you can then check that against the path that is hard-coded in the CGI script, on line 14:

'command_file' => '/var/lib/nagios3/rw/nagios.cmd', # External commands file

Lastly, it could be an issue related to the command file's permissions, in which case you will need to check to see what user ID is running the script, and ensure it has write permission to the command file.