PagerDuty Agent Troubleshooting Guide
PagerDuty Agent is an open-source utility for integrating standalone/on-premise monitoring tools with PagerDuty.
Basic Troubleshooting
If you have trouble sending events to PagerDuty using the agent, run the following checks first:
- Verify that the
pdagent
service is running.. If the service is not running, you can start it with the commandsystemctl start pdagent
(on systemd-based systems) orsudo service pdagent start
(on SysV init-based systems). Any enqueued events that accumulated while the service was stopped will be sent when the service starts again. - Check the service integration key: Make sure that the key is for the correct service and that it has been entered correctly.
- Send a test event using
pd-send
. Remember to replace the key used in the example with your own PagerDuty service integration key. - Check the logs for any errors, including events that couldn't be submitted to the Events API. These are typically found under
/var/log/pdagent/pdagentd.log
.
Outqueue Troubleshooting
In addition to using pd-queue
to manage the queue, the agent's outqueue directories will often hold a wealth of information that is useful for troubleshooting. The queue may also be manually cleared by working directly with the files.
Each file in the outqueue directories contains the body of an HTTP POST request that will be made to the PagerDuty Events API. You examine these files to see what your monitoring tools are submitting to the agent. File names follow the naming convention {UNIX timestamp}_{integration key}.txt
.
Outqueue Directories
The queue directories, located in /var/lib/pdagent/outqueue
, include:
Directory | Details |
---|---|
err | Events that pdagent attempts to deliver to the PagerDuty Events API, but received a 400 Invalid Request response from PagerDuty, will be logged here. Malformed events (i.e., missing an integration key) that cannot be submitted are kept here, too. |
suc | Successfully submitted events. If you see an event file cached here, it was sent to PagerDuty. If the event did not trigger an incident in PagerDuty, and you are having difficulty troubleshooting, please refer to Why Incidents Fail to Trigger. |
tmp | Event files submitted to the agent by external processes, such as integration scripts or pd-send , will be stored in here. |
pdq | The queue of events to be sent; contains a symbolic link to the files in tmp . |
Event Expiration and Queue Cleaning
Events are deleted from the outqueue directories at a regular interval once their age exceeds a predefined length, in seconds. The default age threshold is one week, and the default cleaning interval is three hours. Both the interval and the age threshold are defined in the PagerDuty Agent configuration file, which should be located at
/etc/pdagent.conf
:
cleanup_interval_secs
: The interval between when the queue is cleaned.cleanup_threshold_secs
: The age, in seconds, after which events will be deleted.
Manually Purging the Queue
In the event of a long network outage, where a large number of events accumulate in the agent’s outqueue, you can manually clear out the queue to avoid a storm of incidents in PagerDuty when network connectivity is restored.
To clear the queue, run the following commands:
sudo rm -vf /var/lib/pdagent/outqueue/pdq/*
sudo rm -vf /var/lib/pdagent/outqueue/tmp/*
Network Troubleshooting
The host that is running PagerDuty Agent should be able to make an HTTPS connection to events.pagerduty.com
. With this in mind, if the LAN does not have DNS service, there is no route to the internet, or the local ACL (if any) does not permit connecting to remote hosts, the agent will not be able to send data to PagerDuty.
The following items can help resolve network issues.
DNS
In the host’s command line, run the following command:
dig events.pagerduty.com
If you get a SRVFAIL
or NXDOMAIN
status response, the issue is most likely with DNS.
Reachability
Our Events API hosts should respond to ICMP. Run the following command:
ping -q -c 3 events.pagerduty.com
If you see an indication of packet loss above 0%, or the program doesn't return after about five seconds, there is an issue with your network configuration, i.e., routing.
TCP Connection
Check to see if you can make a TCP connection to the Events API with netcat:
nc -v events.pagerduty.com 443
You should see the following message: Connection to events.pagerduty.com port 443 [tcp/https] succeeded!
.
If the netcat command takes too long to print out any messages, or produces an error message indicating a timeout, connection refusal, etc., check your network's firewall/ACL and refer to Safelist IPs.
Tip
If you’re using BSD netcat, you can add flags
-G 5 -w 1
. This will make the utility exit one second after establishing a connection (-w 1
) and give up and report timeout after five seconds of waiting to establish one (-G 5
).
Testing TLS
PagerDuty Agent comes bundled with its own copy of the public server certificate, which it uses when it connects to validate the connection to PagerDuty’s Events API.
Supported Protocols and the Server Certificate
PagerDuty supports the TLSv1.2 protocol. Older protocols (i.e., SSLv3 and TLSv1.0) are not supported due to security concerns. Traffic using older protocols and versions of TLS will be dropped.
For the security-conscious, the following command checks whether you’re able to establish a TLS connection:
openssl s_client -host events.pagerduty.com -port 443
You should see messages indicating a successful connection/TLS handshake, including the following certificate chain:
Certificate chain
0 s:/OU=GT12858685/OU=See www.rapidssl.com/resources/cps (c)14/OU=Domain Control Validated - RapidSSL(R)/CN=*.pagerduty.com
i:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
1 s:/C=US/O=GeoTrust Inc./CN=RapidSSL SHA256 CA - G3
i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
2 s:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
i:/C=US/O=GeoTrust Inc./CN=GeoTrust Global CA
If you get a TLS handshake error, it may be the case that the GeoTrust root CA certificate isn't trusted on your system. You can find this certificate at DigiCert Trusted Root Authority Certificates.
In most distributions of Linux, the ca-certificates
utility can be used for managing trusted certificates.
Proxy Troubleshooting
Issues with proxies can arise if the daemon’s environment variables are not properly set or if there's an issue connecting to the proxy server (i.e., LAN restrictions). For instructions on configuring PagerDuty Agent to use a proxy server, refer to the PagerDuty Agent installation guide.
Network Issues
The following standard networking tools should be the first things to try when it comes to troubleshooting networking issues.
Example
In the following examples, we’ll assume the proxy server’s hostname is proxy.local
and the port is 8118
.
You can try the following commands to uncover any networking issues:
ping proxy.local
to see if it can be reached.dig proxy.local
orhost proxy.local
to see if the hostname resolves.nc -v proxy.local 8118
to check that the port is open and accepts connections.- Try using
curl
to connect through the proxy to the Events API's agent heartbeat endpoint. Client authentication methods vary; please referenceman curl
for more information about any options specific to your environment. For example:
curl -proxy proxy.local:8118 [additional proxy options] https://api.pagerduty.com/agent/2014-03-14/heartbeat
If you’re able to successfully connect, you should receive a response similar to {"heartbeat_interval_secs":86400}
.
Environment Variables
The agent daemon requires access to environment variables in order to use a proxy and send events to PagerDuty. This is because the underlying Python library used to make the HTTPS connection is urllib2
. From the documentation on urllib2.urlopen:
In addition, if proxy settings are detected (for example, when a
*_proxy
environment variable likehttp_proxy
is set),ProxyHandler
is default installed and makes sure the requests are handled through the proxy.
Under Using a Proxy in the PagerDuty Agent Integration Guide, two ways of configuring the environment variables for the daemon are described: systemd-based versus SysV init-based Linux systems. You can find out which of these your system uses by running sudo stat /proc/1/exe
, which will show you the executable associated with the PID=1 process.
Once you've made sure that the daemon is configured with appropriate environment variables, you will need to make sure the changes are applied, which may vary based on your system. In systemd-based systems, you will be prompted to run systemctl daemon-reload
when you restart pdagent
, since a restart is required in order for changes to the service configuration to take effect.
Other Troubleshooting Steps
You may also wish to see the environment variables that the daemon is running with by adding a logging statement to the agent's component module sendevent
. On most Linux systems this will be in /lib/python2.7/site-packages/pdagent/
. Add the following line to SendEventTask.send_event
:
def send_event(self, json_event_str, event_id):
import os; logger.info(str(os.environ)) # add this line
...
After making the change, restart pdagent
and try sending a test event using pd-send
. After that, you'll find a message in the log that that looks like this:
2017-05-08 18:29:31,556 INFO SendEventTask pdagent.sendevent {'TERM': 'xterm-256color', 'SHELL': '/bin/bash', 'SHLVL': '2', 'SYSTEMCTL_SKIP_REDIRECT': '', 'PWD': '/', 'LOGNAME': 'pdagent', 'USER': 'pdagent', 'HOME': '/home/pdagent', 'SYSTEMCTL_IGNORE_DEPENDENCIES': '', 'PATH': '/sbin:/usr/sbin:/bin:/usr/bin', 'XDG_SESSION_ID': '1', '_': '/usr/share/pdagent/bin/pdagentd.py'}
If the environment variables were set, they should show up in the print-out of the os.environ
dictionary, i.e., 'http_proxy': 'http://proxy.local:8118'
and 'https_proxy': 'https://proxy.local:8118'
, which are missing in the above example.
pdagent.service not found Error
Sometimes on a fresh install of the PagerDuty Agent, you might get the error message Failed to start pdagent.service: Unit pdagent.service not found.
This is due to an error during installation where the file pdagent.service
was not copied to the proper directory.
In order to fix this, you can copy the pdagent.service
file into the proper directory with the following steps:
- Find the file
pdagent.service
with the commandsudo find / -name pdagent.service
. - Copy the location of the file.
- Copy the file to the correct directory with
sudo cp <PATH_TO_FILE> /etc/systemd/system
. - Try booting again with
systemctl start pdagent
orservice pdagent start
, depending on whether your operating system usessystemctl
orservice
.
Updated 5 months ago