Reporting: Analyzing common incidents, trends, and patterns

Follow

PagerDuty's advanced CSV export is a powerful new tool that enables managers and analysts to gain insights into their operations performance. Accounts with Advanced Analytics will be able to use our Advanced CSV exports to answer questions like 

Let's get started

To get started,

  1. create a copy of our Google Spreadsheet template, and
  2. download a CSV file of your incidents.

We don't offer an Excel version because Excel's Pivot Tables don't handle significant source data changes well enough to be useful as a template.

The easiest way to download a CSV file of your incidents is to visit the Incidents Report and click Download CSV to the right of any date:



You can download up to a month's worth of data this way. If you'd like more data, click View Online, then change the date range in the upper-right-hand corner. Then download the CSV.



 

This will download up to 10,000 incidents in our new CSV format. You can also download the same CSV format from our System and Incidents reports.

Click here to see what fields would be included in this export. 

Next, follow the instructions on our Google Spreadsheet template. When you are done, you should have a number of new Pivot Table reports based on your exported CSV data.

Let's make use of this and answer a few common reporting questions.

What are our most common incidents?

Check out the "Most Common Incidents" tab. This shows you the most common incidents in descending order of total incident count. You can also see how each incident was resolved.

More Ideas: Add a report "Filter" to limit this to a particular time range, user, escalation policy, or service. You can even select your non-work hours to see what incidents are waking you up.

How many times did I get woken up last month?

Check out the "Incident Count by Hour/Day" tab. This shows you your incident count, by hour and day of week. Consider copying and pasting this to Excel and applying conditional formatting to get a heatmap.

More Ideas: Add a report "Filter" to limit this to a particular time range, user, escalation policy, or service. You can also add a "Resolved by" filter to see the count of incidents resolved by "API," indicating incidents that were most likely auto-resolved by a flapping monitoring service.

How long does it take to resolve incidents?

Check out the "Outage Hours by Day" tab. This will show you the total elapsed time to resolve by day, giving you a sense of how long you have open incidents for.

More Ideas: Try filtering this by escalation policy or user. You could also filter out incidents resolved by timeout.

Which of our services are the noisiest?

Check out the "How are incidents resolved" tab. This will show you the number of incidents for each service resolved by API, indicating incidents that were most likely auto-resolved by a flapping monitoring service.

How many incidents did each user resolve?

Check out the "Number of incidents resolved by user" tab.

What incidents are getting missed by my users?

Go to the "Input" tab, then sort "auto_escalation_count" from Z–A (descending). On top, you'll find the incidents that got auto-escalated the most. Want to see what happened? Visit https://[subdomain].pagerduty.com/incidents/[id], where [subdomain] is your PagerDuty subdomain and [id] is the alpha-numeric ID from Column A.

More Ideas: Try sorting by seconds_to_first_ack or seconds_to_resolve to see incidents that took a long time to acknowledge and resolve. You might also want to filter out auto_resolved = 1 to exclude the incidents that were resolved by timeout.

How often do we meet our TTA and TTR SLAs?

Check out the "TTA and TTR SLA scorecard" tab. This shows you the percentage of incidents that met a 5-m TTA and 60-minute TTR SLA by escalation policy and service.

More Ideas:

  • Filter the pivot table to exclude auto-resolved incidents.
  • You can adjust the SLA by editing the respective columns on the "Input" tab. Just look for "Met 5m TTA SLA?" and "Met 60m TTR SLA?". The SLAs are entered in seconds. Consider changing the column headers, too.
  • You can specify a different TTA and TTR SLA for each service or escalation policy, too. To do this, first create a new tab and list your service or escalation policy names in column A. Then, list your TTA SLA (in seconds) in column B and your TTR SLA (in seconds) in column C. Finally, use the VLOOKUP function in conjunction with the existing IF function to compare the incident's TTA and TTR against the respective TTA and TTR in this sheet.

What is the distribution of our TTAs?

Check out the "TTA Histogram" tab. This will show you the count of incidents broken down by TTA in minutes. The horizontal axis of the sample chart is on a log scale.

How do I group/classify/tag incidents?

On the Inputs sheet, add another column with the function

=REGEXREPLACE(C2, "[0-9]+","")

Next, create a Pivot Table similar to our "Most Common Incidents" sheet using this column instead of column C. This will remove all of the numbers from the incident description, making it easier to see which classes incidents are causing you troubles. Edit the regular expression as you see fit to remove other common terms. For instance, you could remove various host names to focus on what types of incidents are being created across all hosts.

It's also possible to do the above in Excel, but it's a bit more challenging because Excel doesn't offer an out-of-the-box regular expressions function. Thus, even the simple removal of numbers will require ten nested REPLACE statements.

You can also use regular expressions, too.

On the Inputs sheet, add a column with regex and nested IF statements, such as:

=IF(REGEXMATCH(C2, "DB"),"DB",IF(REGEXMATCH(C2, "WWW"),"WWW",IF(REGEXMATCH(C2, "load"),"load","other")))

Here's a more sophisticated example based on the categorization rules one of our customers put together:

=IF(REGEXMATCH(C2, "Monkey Exceptions"),"Exceptions",
IF(REGEXMATCH(C2, "staging"),"Staging",
IF(REGEXMATCH(C2, "nfs"), "NFS",
IF(REGEXMATCH(C2, "ftp"), "FTP",
IF(REGEXMATCH(C2, "db1|db2|db0[1-9]\.mp"),"DB",
IF(REGEXMATCH(C2,"dj"),"DJ",
IF(REGEXMATCH(C2,"splunk"),"Splunk",
IF(REGEXMATCH(C2,"phoenix"),"phoenix",
IF(REGEXMATCH(C2,"reporting-db"),"Reporting DB",
IF(REGEXMATCH(C2,"elastic|es-util|es0|en-util|search-index|search-master"),"Elastic Search",
IF(REGEXMATCH(C2,"web"),"Web",
IF(REGEXMATCH(C2,"cron"),"Cron",""))))))))))))

Note: Remove the line breaks from above if you'd like to leverage this example.

So, an incident description like "[Critical] db04.mp is DOWN" would return simply "DB", making it simple to see the total number of DB-related issues your team faces.

Edit the regular expressions and labels as you see fit, then create a Pivot Table using this column. You may have an easier time editing the nested IF statements in a text editor. This will allow you to build a simple classification model by categorizing various matches against the incident description. You can even build a Pivot Table chart that will show you the incidence of various classes of incidents over time.

Have more questions? Submit a request

Comments