SolarWinds Orion Troubleshooting Guide

The SolarWinds Orion integration is based on a VBScript that runs locally on the SolarWinds host. The script is designed to work as follows:

  1. When an alert triggers in Solarwinds Orion, a text file containing JSON-encoded data is created, based on an alert template, through a Log Alert to File alert action. This file is intended to be sent to PagerDuty and is stored temporarily in C:\PagerDuty\Queue.
  2. The script is executed to deliver that event (and any others that are still in that directory) to PagerDuty through the Events API v1.
  3. The script removes the file once delivery is successful.

If the integration is not working as expected, it typically has something to do with one of the steps outlined above.

Errors Importing Sample Alerts

If an alert cannot be imported, you may be presented with an error such as the following:

There was an error while importing alert from ... Please check OrionWeb.log for more details.

To find the path to the error’s log file, refer to Orion Core Logs. In the case of OrionWeb.log, the path will be Logs\Orion\OrionWeb.log, and the Logs folder will be in C:\Documents and Settings\All Users\Application Data\SolarWinds\ or C:\Documents and Settings\All Users\SolarWinds\.

The error in the log will provide more information on why the import did not succeed.

Sometimes, the error is related to one or more software components that are not present in the SolarWinds Orion installation. For example, APM (Application Performance Monitor) alerts, which have "Application" in their names, require that the APM module be installed and activated.

PagerDuty Incidents Not Triggered

The first place to look is in C:\PagerDuty\Queue for .txt files that are "stuck" in the event queue. They would have creation timestamps earlier than a few seconds ago.

If there are any stuck event files, move them out of the folder so that the script will stop trying to send them to PagerDuty while you troubleshoot the integration.

Malformed Events

Open some of the text files to review their contents. The files' contents must be valid JSON, and the integration key must be a valid value. If the file contains either invalid JSON or is missing a valid integration key, PagerDuty’s Events API will be unable to process the event.

If you notice anything that needs adjustment, you can address this in SolarWinds Orion by going to Main Settings & Administration Manage Alerts or Manage Custom Properties.

A screenshot of the SolarWinds Orion UI indicating how to navigate to Main Settings & Administration

Main Settings & Administration

Stray Quote Characters

If property names or values contain escaped double quote characters (i.e., for instance, "Last Time \"Up\""), this will cause the alert sent to PagerDuty to contain invalid JSON. To address this:

  1. From the main menu, go to Manage Alerts.
  2. Find the alert that created the file and select the alert to edit it.
  3. Navigate to the Trigger Actions step.
  4. Under the Trigger Actions section on this page, click Edit on the Log Alert to PagerDuty Queue action, remove the errant characters from the message template, and save the alert.
  5. To test the updated alert action, click Simulate, select a node, and open the file to verify it is now valid JSON. You can also click Simulate on the Execute PagerDuty VBScript action to send a mock event and trigger a PagerDuty incident.
A screenshot of the SolarWinds Orion UI detailing an alert's trigger actions

Trigger Actions

Blank or Invalid Integration Key

Next, look for the service_key property in the JSON data, and verify that it has a valid integration key. This should be the 32-character key generated for the integration in the setup process. If the property is blank or does not exist, or if the integration key is not associated with any of your PagerDuty services, you should modify the node in question for the alert and set a value for the integration key as described in the SolarWinds Integration Guide.

Skip Alerts if the Integration Key is Blank

If you see a line in the event file that looks like the following, then there is a node that lacks a value in the PDIntegrationKey field:

   "service_key": "",

One option to resolve this is to set a value for the PDIntegrationKey field on the node(s) that produced the invalid events.

However, there is another approach involving the Trigger Condition that will help you avoid this problem in the future. This alternate approach has the following advantages:

  • It will allow you to create nodes that optionally don't send alerts to PagerDuty.
  • It will avoid creating invalid alert events that collect in the queue folder.

You can implement this with the following steps:

  1. From the main menu, go to Settings All Settings.
  2. Under Alerts Reports go to Manage Alerts.
  3. Click an alert to edit it and go to the Trigger Condition step.
  4. Make sure that conditions are combined with AND.
  5. Click to add a trigger condition and select Add Single Value Comparison.
A screenshot of the SolarWinds Orion UI detailing an alert's trigger conditions

Trigger conditions

  1. Set the condition to Node :: PDIntegrationKey :: is not empty.
A screenshot of the SolarWinds Orion UI detailing that a node's "PDIntegrationKey" field is not empty

Node "PDIntegrationKey" field is not empty

Review Application Log Errors

The script’s error output can reveal useful troubleshooting information. Review this information with the following steps:

  1. Go to Control Panel System and Security Administrative Tools Event Viewer.
A screenshot of the Windows UI detailing how to access application log errors

Application log errors

  1. Select Windows Logs Application.
A screenshot of the Windows UI showing how to select "Windows Logs > Application"

Select Windows Logs > Application

🚧

Connection to PagerDuty

If the VBScript cannot connect to PagerDuty’s Events API v1, you will see related errors in the Windows Event Logs. If you run the script manually, errors will appear in the command line output.

  1. Look for an error source WSH:
  • If it is from msxml3.dll (or msxml6.dll) and the message is Couldn't connect or send data to PagerDuty, it indicates a network connection issue.
  • If the error is File could not be opened or similar, the issue is with the permissions and ownership of the file.
A screenshot of the Windows UI detailing Windows Event Logs

Windows Event Logs

A screenshot of the Windows UI detailing an error in Windows Event Logs: "Couldn't read alert file."

Windows Event Logs error

Outbound HTTPS Connectivity

If the problem appears to be network-related, perform the following diagnostic steps to determine the nature of the connection issue:

  1. Open command prompt (cmd.exe) and run ping events.pagerduty.com.
A screenshot of the command line showing successful ping results

Successful ping results

  1. There are two things to check for ping fails:
  • If you receive an "unknown host" error, this indicates a localized DNS resolution issue.
  • If you receive a timeout or packet rejection, this could be a network or routing issue. This may also indicate that the local network's ACL is rejecting or dropping outbound ICMP traffic; contact your network administrator for clarification.
  1. If ping succeeds, try establishing a TCP connection via telnet.exe events.pagerduty.com 443. Note the following:
  • If you don't have the telnet client enabled on the SolarWinds server, you can enable it by following Microsoft Technet's guide to enabling telnet on Windows 10. If for some reason you cannot enable telnet on the SolarWinds server, skip to the next step.
  • If you get a Connect failed error, it is due to firewall/ACL rules that prevent outbound TCP connections.
  • Otherwise, try typing in GET / and hit enter. If you see an HTTP 400 response, that means the TCP connection succeeded and telnet received a HTTP response from the Events API server.
A screenshot of the command line indicating that the TCP connection to the Events API failed

TCP connection to the Events API failed

A screenshot of the command line showing the expected response when the TCP connection succeeded

Expected response when the TCP connection succeeded

  1. Open the Edge internet browser and navigate to https://events.pagerduty.com.
  • You should receive a HTTP response (status 404) and a "webpage cannot be found" message, which indicates that both establishing a TCP connection and the TLS handshake succeeded.
  • If you were able to connect via Telnet but not Edge, this may indicate that the certificate authority behind the Events API's TLS certificate is not trusted locally. To resolve this, try adding the GeoTrust Root CA certificates to your trusted certificates. See Microsoft Technet: Manage Trusted Root Certificates for further details.
A screenshot of the Edge internet browser indicating a successful test

Successful test

A screenshot of the Edge internet browser indicating a possible certificate issue

Possible certificate issue

Once you have gathered the necessary information about the nature of the connection issue, contact your local network or IT administrator for further assistance. Ask them to enable outbound HTTPS connections to remote hosts. It is also helpful to confirm there are no local firewall rules (e.g., on the SolarWinds server) that prevent connections.

Alert Recovery Does Not Resolve PagerDuty Incident

  1. Go to Settings All Settings Manage Alerts.
  2. Select the alert that did not automatically resolve and go to the Reset Actions step.
A screenshot of the SolarWinds Orion UI showing how to edit an alert

Edit Alert

  1. Ensure that it has the Log Recovery to PagerDuty Queue action.
  2. Edit the log-to-queue action, and make sure all of the following apply:
  • The event_type property in the JSON is resolve.
  • The incident_key is the same value as it is in the template for the log-to-queue action in the Trigger Actions step.
  1. Simulate the recovery action and review the content of the file that it creates. Do the same with the alert action. Compare the incident_key values in both files.

When a resolve event is sent to PagerDuty and its deduplication key does not match an existing open incident or alert, the event will be dropped because there is nothing for the resolve to act upon.

If any template variables (e.g.,${...}) are present in the template, you will need to ensure that they are the right kind of variables, i.e., they must have the same values during the reset action as they have in the trigger action. For instance, node state is not an optimal choice of template variable to use in an incident deduplication key because the node state will be "Up" during the reset action, and likely "Down" for the alert action.