PagerDuty Agent Troubleshooting Guide

PagerDuty Agent is an open-source utility for integrating standalone/on-premise monitoring tools with PagerDuty.

Basic Troubleshooting

If you have trouble sending events to PagerDuty using the agent, run the following checks first:

  • Verify that the pdagent service is running.. If the service is not running, you can start it with the command systemctl start pdagent (on systemd-based systems) or sudo service pdagent start (on SysV init-based systems). Any enqueued events that accumulated while the service was stopped will be sent when the service starts again.
  • Check the service integration key: Make sure that the key is for the correct service and that it has been entered correctly.
  • Send a test event using pd-send. Remember to replace the key used in the example with your own PagerDuty service integration key.
  • Check the logs for any errors, including events that couldn't be submitted to the Events API. These are typically found under /var/log/pdagent/pdagentd.log.

Outqueue Troubleshooting

In addition to using pd-queue to manage the queue, the agent's outqueue directories will often hold a wealth of information that is useful for troubleshooting. The queue may also be manually cleared by working directly with the files.

Each file in the outqueue directories contains the body of an HTTP POST request that will be made to the PagerDuty Events API. You examine these files to see what your monitoring tools are submitting to the agent. File names follow the naming convention {UNIX timestamp}_{integration key}.txt.

Outqueue Directories

The queue directories, located in /var/lib/pdagent/outqueue, include:

DirectoryDetails
errEvents that pdagent attempts to deliver to the PagerDuty Events API, but received a 400 Invalid Request response from PagerDuty, will be logged here. Malformed events (i.e., missing an integration key) that cannot be submitted are kept here, too.
sucSuccessfully submitted events. If you see an event file cached here, it was sent to PagerDuty. If the event did not trigger an incident in PagerDuty, and you are having difficulty troubleshooting, please refer to Why Incidents Fail to Trigger.
tmpEvent files submitted to the agent by external processes, such as integration scripts or pd-send, will be stored in here.
pdqThe queue of events to be sent; contains a symbolic link to the files in tmp.

🚧

Event Expiration and Queue Cleaning

Events are deleted from the outqueue directories at a regular interval once their age exceeds a predefined length, in seconds. The default age threshold is one week, and the default cleaning interval is three hours. Both the interval and the age threshold are defined in the PagerDuty Agent configuration file, which should be located at /etc/pdagent.conf:

  • cleanup_interval_secs: The interval between when the queue is cleaned.
  • cleanup_threshold_secs: The age, in seconds, after which events will be deleted.

Manually Purging the Queue

In the event of a long network outage, where a large number of events accumulate in the agent’s outqueue, you can manually clear out the queue to avoid a storm of incidents in PagerDuty when network connectivity is restored.

To clear the queue, run the following commands:

sudo rm -vf /var/lib/pdagent/outqueue/pdq/*
sudo rm -vf /var/lib/pdagent/outqueue/tmp/*

Network Troubleshooting

The host that is running PagerDuty Agent should be able to make an HTTPS connection to events.pagerduty.com. With this in mind, if the LAN does not have DNS service, there is no route to the internet, or the local ACL (if any) does not permit connecting to remote hosts, the agent will not be able to send data to PagerDuty.

The following items can help resolve network issues.

DNS

In the host’s command line, run the following command:

dig events.pagerduty.com

If you get a SRVFAIL or NXDOMAIN status response, the issue is most likely with DNS.

Reachability

Our Events API hosts should respond to ICMP. Run the following command:

ping -q -c 3 events.pagerduty.com

If you see an indication of packet loss above 0%, or the program doesn't return after about five seconds, there is an issue with your network configuration, i.e., routing.

TCP Connection

Check to see if you can make a TCP connection to the Events API with netcat:

nc -v events.pagerduty.com 443

You should see the following message: Connection to events.pagerduty.com port 443 [tcp/https] succeeded!.

If the netcat command takes too long to print out any messages, or produces an error message indicating a timeout, connection refusal, etc., check your network's firewall/ACL and refer to Safelist IPs.

📘

Tip

If you’re using BSD netcat, you can add flags -G 5 -w 1. This will make the utility exit one second after establishing a connection (-w 1) and give up and report timeout after five seconds of waiting to establish one (-G 5).

Testing TLS

PagerDuty Agent comes bundled with its own copy of the public server certificate, which it uses when it connects to validate the connection to PagerDuty’s Events API.

🚧

Supported Protocols and the Server Certificate

PagerDuty supports the TLSv1.2 protocol. Older protocols (i.e., SSLv3 and TLSv1.0) are not supported due to security concerns. Traffic using older protocols and versions of TLS will be dropped.

For the security-conscious, the following command checks whether you’re able to establish a TLS connection:

openssl s_client -host events.pagerduty.com -port 443

You should see messages indicating a successful connection/TLS handshake, including the following certificate chain:

Certificate chain
 0 s:/OU=GT12858685/OU=See www.rapidssl.com/resources/cps (c)14/OU=Domain Control Validated - RapidSSL(R)/CN=*.pagerduty.com
   i:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
 1 s:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
   i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
 2 s:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
   i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA

If you get a TLS handshake error, it may be the case that the GeoTrust root CA certificate isn't trusted on your system. You can find this certificate at DigiCert Trusted Root Authority Certificates.

In most distributions of Linux, the ca-certificates utility can be used for managing trusted certificates.

Proxy Troubleshooting

Issues with proxies can arise if the daemon’s environment variables are not properly set or if there's an issue connecting to the proxy server (i.e., LAN restrictions). For instructions on configuring PagerDuty Agent to use a proxy server, refer to the PagerDuty Agent installation guide.

Network Issues

The following standard networking tools should be the first things to try when it comes to troubleshooting networking issues.

Example
In the following examples, we’ll assume the proxy server’s hostname is proxy.local and the port is 8118.

You can try the following commands to uncover any networking issues:

  1. ping proxy.local to see if it can be reached.
  2. dig proxy.local or host proxy.local to see if the hostname resolves.
  3. nc -v proxy.local 8118 to check that the port is open and accepts connections.
  4. Try using curl to connect through the proxy to the Events API's agent heartbeat endpoint. Client authentication methods vary; please reference man curl for more information about any options specific to your environment. For example:
curl -proxy proxy.local:8118 [additional proxy options] https://api.pagerduty.com/agent/2014-03-14/heartbeat

If you’re able to successfully connect, you should receive a response similar to {"heartbeat_interval_secs":86400}.

Environment Variables

The agent daemon requires access to environment variables in order to use a proxy and send events to PagerDuty. This is because the underlying Python library used to make the HTTPS connection is urllib2. From the documentation on urllib2.urlopen:

In addition, if proxy settings are detected (for example, when a *_proxy environment variable like http_proxy is set), ProxyHandler is default installed and makes sure the requests are handled through the proxy.

Under Using a Proxy in the PagerDuty Agent Integration Guide, two ways of configuring the environment variables for the daemon are described: systemd-based versus SysV init-based Linux systems. You can find out which of these your system uses by running sudo stat /proc/1/exe, which will show you the executable associated with the PID=1 process.

Once you've made sure that the daemon is configured with appropriate environment variables, you will need to make sure the changes are applied, which may vary based on your system. In systemd-based systems, you will be prompted to run systemctl daemon-reload when you restart pdagent, since a restart is required in order for changes to the service configuration to take effect.

Other Troubleshooting Steps

You may also wish to see the environment variables that the daemon is running with by adding a logging statement to the agent's component module sendevent. On most Linux systems this will be in /lib/python2.7/site-packages/pdagent/. Add the following line to SendEventTask.send_event:

def send_event(self, json_event_str, event_id):
    import os; logger.info(str(os.environ)) # add this line
    ...

After making the change, restart pdagent and try sending a test event using pd-send. After that, you'll find a message in the log that that looks like this:

2017-05-08 18:29:31,556 INFO    SendEventTask        pdagent.sendevent    {'TERM': 'xterm-256color', 'SHELL': '/bin/bash', 'SHLVL': '2', 'SYSTEMCTL_SKIP_REDIRECT': '', 'PWD': '/', 'LOGNAME': 'pdagent', 'USER': 'pdagent', 'HOME': '/home/pdagent', 'SYSTEMCTL_IGNORE_DEPENDENCIES': '', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'XDG_SESSION_ID': '1', '_': '/usr/share/pdagent/bin/pdagentd.py'}

If the environment variables were set, they should show up in the print-out of the os.environ dictionary, i.e., 'http_proxy': 'http://proxy.local:8118' and 'https_proxy': 'https://proxy.local:8118', which are missing in the above example.

pdagent.service not found Error

Sometimes on a fresh install of the PagerDuty Agent, you might get the error message Failed to start pdagent.service: Unit pdagent.service not found. This is due to an error during installation where the file pdagent.service was not copied to the proper directory.

In order to fix this, you can copy the pdagent.service file into the proper directory with the following steps:

  1. Find the file pdagent.service with the command sudo find / -name pdagent.service.
  2. Copy the location of the file.
  3. Copy the file to the correct directory with sudo cp <PATH_TO_FILE> /etc/systemd/system.
  4. Try booting again with systemctl start pdagent or service pdagent start, depending on whether your operating system uses systemctl or service.

Learn more