PagerDuty Agent Troubleshooting Guide

PagerDuty Agent is an open-source utility for integrating standalone / on-premise monitoring tools with PagerDuty.

First Steps

If you have trouble sending events to PagerDuty using the agent, run the following checks:

  • Verify that the pdagent service is running. If the service is not running, you can start it with the command systemctl start pdagent (on systemd-based systems) or sudo service pdagent start (on SysV init-based systems). Any events that were enqueued while the service was stopped will be sent when the service starts.
  • Check the service integration key: make sure that the key is for the correct service and that it has been entered correctly.
  • Send a test event using pd-send as shown in Sending an Event to PagerDuty. Remember to replace the key used in the example with your own PagerDuty Service Integration Key.
  • Check the logs for any errors, including events that couldn't be submitted to the Events API: /var/log/pdagent/pdagentd.log

Outqueue Troubleshooting

In addition to using pd-queue to manage the queue, the agent's outqueue directories will often hold a wealth of information that is useful for troubleshooting. The queue may also be manually cleared or altered by working directly with the files.

Each file in these directories contains the body of a HTTP POST request that will be made to the PagerDuty Events API. You can thus see what your monitoring solutions are submitting to the agent by examining these files. They are given filenames that follow the naming convention:

{UNIX timestamp}_{integration key}.txt

The Outqueue Directories

The queue directories, located in /var/lib/pdagent/outqueue, include:

Directory
Role

err

Events for which delivery to the PagerDuty Events API was attempted, but was not successful due to a 400 Invalid Request response from PagerDuty. Malformed events (i.e. having no integration key) that cannot be submitted will thus be kept here.

suc

Successfully submitted events. If you see an event file cached here, it was sent to PagerDuty. If the event did not trigger an incident in PagerDuty, and you are having difficulty troubleshooting why it did not, contact PagerDuty Support.

tmp

Event files submitted to the agent by external processes, i.e. integration scripts or pd-send, will be stored in here.

pdq

The queue of events to be sent; will contain symbolic links to the files in tmp.

Event Expiration and Queue Cleaning

Events are deleted from the outqueue directories at a regular interval, when their age exceeds a predefined length in seconds. The default age threshold is one week old, and the default cleaning interval is three hours. Both the interval and the age threshold are defined in the main PagerDuty Agent configuration file, which should be located at path /etc/pdagent.conf:

  • cleanup_interval_secs: The interval between when the queue is cleaned.
  • cleanup_threshold_secs: The age in seconds, beyond which events will be deleted.

Manually Purging the Queue

In the event of a long network outage wherein a huge pileup of pending events occurs in the agent, the queue can be cleared out in order to avoid a storm of incidents in PagerDuty when the network connection is restored.

To clear the queue, run the following commands:

sudo rm -vf /var/lib/pdagent/outqueue/pdq/*
sudo rm -vf /var/lib/pdagent/outqueue/tmp/*

Network Troubleshooting

The host that is running PagerDuty Agent should be able to make a HTTPS connection to events.pagerduty.com. Thus, if the LAN has no DNS service, or there is no route to the internet, or the local ACL (if any) does not permit connecting to remote hosts across the internet, PagerDuty Agent will not be able to send data to PagerDuty.

First, verify that you can resolve the Events API:

dig events.pagerduty.com

If you get a SRVFAIL or NXDOMAIN status response, the issue is with DNS.

Next, test reachability. Our events API hosts should respond to ICMP, so you can use the ping program as follows:

ping -q -c 3 events.pagerduty.com

If you see an indication of packet loss above 0%, or the program doesn't return after about five seconds, there is an issue with your network configuration, i.e. routing.

Next, check to see if you can make a TCP connection to the Events API with netcat:

nc -G 5 -w 1 -v events.pagerduty.com 443

You should see the following message:

Connection to events.pagerduty.com port 443 [tcp/https] succeeded!

The above command will exit one second after establishing a connection (-w 1) and will give up and report timeout after five seconds of waiting to establish one (-G 5). If this connection prints an error message indicating a timeout, connection refusal (etc), check your network's firewall/ACL and refer to Whitelisting IPs.

Next, check to see if a TLS connection can be established:

openssl s_client -host events.pagerduty.com -port 443

You should see messages indicating a successful connection / TLS handshake, including the following certificate chain:

Certificate chain
 0 s:/OU=GT12858685/OU=See www.rapidssl.com/resources/cps (c)14/OU=Domain Control Validated - RapidSSL(R)/CN=*.pagerduty.com
   i:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
 1 s:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
   i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
 2 s:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
   i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA

If you get a TLS handshake error, it may be the case that the GeoTrust root CA certificate isn't trusted on your system. If that is the case, you can obtain the certificate here:

https://www.geotrust.com/resources/root-certificates/

In most distributions of Linux, the ca-certificates utility can be used for managing trusted certificates.

Supported Protocols

PagerDuty as of this writing only supports protocols TLSv1.1 and later. Older protocols (i.e. SSLv3) are widely considered to not be secure and thus are not supported.

Proxy Troubleshooting

One issue that can be tricky to get past is proxy configuration. If web requests are not going through the proxy, or not going anywhere, do you know whether your system is running the daemon without the proper environment variables set, or if there's an issue with connecting to the proxy server (i.e. LAN restrictions)?

Network Issues Maybe?

Before any of the following, it's a good idea to check that the proxy server can be reached using standard networking tools. If the hostname is proxy.local and the port is 8118, the commands to do this will be similar to the basic troubleshooting steps in the previous section:

  1. ping proxy.local to see if it can be reached
  2. dig proxy.local or host proxy.local to see if the hostname resolves
  3. nc -v proxy.local 8118 to check that the port is open and can be connected to
  4. Finally, try using curl to connect through the proxy to the events API's agent heartbeat endpoint. How clients authenticate through the proxy may vary, and other settings may be needed, so you will need to specify extended proxy options; see the documentation for this under man curl
curl -proxy proxy.local:8118 [additional proxy options] https://api.pagerduty.com/agent/2014-03-14/heartbeat

If all goes well, you should receive a response that looks like {"heartbeat_interval_secs":86400}

Environment Variables?

The thread that actually performs the sending of events to PagerDuty is the daemon, and for it to use the proxy requires environment variables. To elucidate, the underlying Python library used to make the HTTPS connection is urllib2. Per the documentation on urllib2.urlopen:

In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.

Under "Using a Proxy" in the Agent Install Guide, two ways of configuring the environment variables for the daemon are described: systemd-based versus SysV init-based Linux systems. You can find out which of these your system uses by running sudo stat /proc/1/exe, which will show you the executable associated with the PID=1 process.

Once you've made sure that the daemon is configured with the environment variables using the correct method, you will need to make sure the changes are applied, which may vary based on your system. In systemd based systems, you will be prompted to run systemctl daemon-reload when you restart service pdagent after modifying the service configuration, after which the service should be restarted once more for the changes to finally take effect.

Still Stuck?

As a last resort, failing all other troubleshooting, you can see the actual environment variables that the daemon is running with by adding a logging statement to the agent's component module sendevent. On most Linux systems this will be in /lib/python2.7/site-packages/pdagent/, and the method to alter is the method SendEventTask.send_event:

def send_event(self, json_event_str, event_id):
    import os; logger.info(str(os.environ)) # add this line

After making the change, restart pdagent and try sending a test event using pd-send. After that, you'll find a message in the log that that looks like this:

2017-05-08 18:29:31,556 INFO    SendEventTask        pdagent.sendevent    {'TERM': 'xterm-256color', 'SHELL': '/bin/bash', 'SHLVL': '2', 'SYSTEMCTL_SKIP_REDIRECT': '', 'PWD': '/', 'LOGNAME': 'pdagent', 'USER': 'pdagent', 'HOME': '/home/pdagent', 'SYSTEMCTL_IGNORE_DEPENDENCIES': '', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'XDG_SESSION_ID': '1', '_': '/usr/share/pdagent/bin/pdagentd.py'}

If the environment variables were set, they would have shown up in the print-out of the os.environ dictionary, i.e. 'http_proxy': 'http://proxy.local:8118' and 'https_proxy': 'https://proxy.local:8118', which (note) are missing in the above line.

PagerDuty Agent Troubleshooting Guide