SolarWinds Orion Troubleshooting Guide
The SolarWinds Orion integration is based on a VBScript that runs locally on the SolarWinds host. The script is designed to work as follows:
- When an alert triggers in Solarwinds Orion, a text file containing JSON-encoded data is created, based on an alert template, through a Log Alert to File alert action. This file is intended to be sent to PagerDuty and is stored temporarily in
C:\PagerDuty\Queue
. - The script is executed to deliver that event (and any others that are still in that directory) to PagerDuty through the Events API v1.
- The script removes the file once delivery is successful.
If the integration is not working as expected, it typically has something to do with one of the steps outlined above.
Errors Importing Sample Alerts
If an alert cannot be imported, you may be presented with an error such as the following:
There was an error while importing alert from ... Please check OrionWeb.log for more details.
To find the path to the error’s log file, refer to Orion Core Logs. In the case of OrionWeb.log
, the path will be Logs\Orion\OrionWeb.log
, and the Logs
folder will be in C:\Documents and Settings\All Users\Application Data\SolarWinds\
or C:\Documents and Settings\All Users\SolarWinds\
.
The error in the log will provide more information on why the import did not succeed.
Sometimes, the error is related to one or more software components that are not present in the SolarWinds Orion installation. For example, APM (Application Performance Monitor) alerts, which have "Application" in their names, require that the APM module be installed and activated.
PagerDuty Incidents Not Triggered
The first place to look is in C:\PagerDuty\Queue
for .txt
files that are "stuck" in the event queue. They would have creation timestamps earlier than a few seconds ago.
If there are any stuck event files, move them out of the folder so that the script will stop trying to send them to PagerDuty while you troubleshoot the integration.
Malformed Events
Open some of the text files to review their contents. The files' contents must be valid JSON, and the integration key must be a valid value. If the file contains either invalid JSON or is missing a valid integration key, PagerDuty’s Events API will be unable to process the event.
If you notice anything that needs adjustment, you can address this in SolarWinds Orion by going to Main Settings & Administration Manage Alerts or Manage Custom Properties.
Stray Quote Characters
If property names or values contain escaped double quote characters (i.e., for instance, "Last Time \"Up\""
), this will cause the alert sent to PagerDuty to contain invalid JSON. To address this:
- From the main menu, go to Manage Alerts.
- Find the alert that created the file and select the alert to edit it.
- Navigate to the Trigger Actions step.
- Under the Trigger Actions section on this page, click Edit on the Log Alert to PagerDuty Queue action, remove the errant characters from the message template, and save the alert.
- To test the updated alert action, click Simulate, select a node, and open the file to verify it is now valid JSON. You can also click Simulate on the Execute PagerDuty VBScript action to send a mock event and trigger a PagerDuty incident.
Blank or Invalid Integration Key
Next, look for the service_key
property in the JSON data, and verify that it has a valid integration key. This should be the 32-character key generated for the integration in the setup process. If the property is blank or does not exist, or if the integration key is not associated with any of your PagerDuty services, you should modify the node in question for the alert and set a value for the integration key as described in the SolarWinds Integration Guide.
Skip Alerts if the Integration Key is Blank
If you see a line in the event file that looks like the following, then there is a node that lacks a value in the PDIntegrationKey
field:
"service_key": "",
One option to resolve this is to set a value for the PDIntegrationKey
field on the node(s) that produced the invalid events.
However, there is another approach involving the Trigger Condition that will help you avoid this problem in the future. This alternate approach has the following advantages:
- It will allow you to create nodes that optionally don't send alerts to PagerDuty.
- It will avoid creating invalid alert events that collect in the queue folder.
You can implement this with the following steps:
- From the main menu, go to Settings All Settings.
- Under Alerts Reports go to Manage Alerts.
- Click an alert to edit it and go to the Trigger Condition step.
- Make sure that conditions are combined with AND.
- Click to add a trigger condition and select Add Single Value Comparison.
- Set the condition to
Node :: PDIntegrationKey :: is not empty
.
Review Application Log Errors
The script’s error output can reveal useful troubleshooting information. Review this information with the following steps:
- Go to Control Panel System and Security Administrative Tools Event Viewer.
- Select Windows Logs Application.
Connection to PagerDuty
If the VBScript cannot connect to PagerDuty’s Events API v1, you will see related errors in the Windows Event Logs. If you run the script manually, errors will appear in the command line output.
- Look for an error source
WSH
:
- If it is from
msxml3.dll
(ormsxml6.dll
) and the message isCouldn't connect or send data to PagerDuty
, it indicates a network connection issue. - If the error is
File could not be opened
or similar, the issue is with the permissions and ownership of the file.
Outbound HTTPS Connectivity
If the problem appears to be network-related, perform the following diagnostic steps to determine the nature of the connection issue:
- Open command prompt (
cmd.exe
) and runping events.pagerduty.com
.
- There are two things to check for ping fails:
- If you receive an "unknown host" error, this indicates a localized DNS resolution issue.
- If you receive a timeout or packet rejection, this could be a network or routing issue. This may also indicate that the local network's ACL is rejecting or dropping outbound ICMP traffic; contact your network administrator for clarification.
- If ping succeeds, try establishing a TCP connection via
telnet.exe events.pagerduty.com 443
. Note the following:
- If you don't have the telnet client enabled on the SolarWinds server, you can enable it by following Microsoft Technet's guide to enabling telnet on Windows 10. If for some reason you cannot enable telnet on the SolarWinds server, skip to the next step.
- If you get a
Connect failed
error, it is due to firewall/ACL rules that prevent outbound TCP connections. - Otherwise, try typing in
GET /
and hit enter. If you see an HTTP 400 response, that means the TCP connection succeeded andtelnet
received a HTTP response from the Events API server.
- Open the Edge internet browser and navigate to
https://events.pagerduty.com
.
- You should receive a HTTP response (status
404
) and a "webpage cannot be found" message, which indicates that both establishing a TCP connection and the TLS handshake succeeded. - If you were able to connect via Telnet but not Edge, this may indicate that the certificate authority behind the Events API's TLS certificate is not trusted locally. To resolve this, try adding the GeoTrust Root CA certificates to your trusted certificates. See Microsoft Technet: Manage Trusted Root Certificates for further details.
Once you have gathered the necessary information about the nature of the connection issue, contact your local network or IT administrator for further assistance. Ask them to enable outbound HTTPS connections to remote hosts. It is also helpful to confirm there are no local firewall rules (e.g., on the SolarWinds server) that prevent connections.
Alert Recovery Does Not Resolve PagerDuty Incident
- Go to Settings All Settings Manage Alerts.
- Select the alert that did not automatically resolve and go to the Reset Actions step.
- Ensure that it has the Log Recovery to PagerDuty Queue action.
- Edit the log-to-queue action, and make sure all of the following apply:
- The
event_type
property in the JSON isresolve
. - The
incident_key
is the same value as it is in the template for the log-to-queue action in the Trigger Actions step.
- Simulate the recovery action and review the content of the file that it creates. Do the same with the alert action. Compare the
incident_key
values in both files.
When a resolve
event is sent to PagerDuty and its deduplication key does not match an existing open incident or alert, the event will be dropped because there is nothing for the resolve to act upon.
If any template variables (e.g.,${...}
) are present in the template, you will need to ensure that they are the right kind of variables, i.e., they must have the same values during the reset action as they have in the trigger action. For instance, node state is not an optimal choice of template variable to use in an incident deduplication key because the node state will be "Up" during the reset action, and likely "Down" for the alert action.
Updated about 2 months ago