PagerDuty AIOps Quickstart Guide
Set up PagerDuty AIOps in 90 minutes
For many teams, adopting PagerDuty AIOps is key to seeing fewer incidents and faster resolution. By leveraging Machine Learning (ML) and automation, our customers see 14% faster resolution, 87% fewer incidents, and are able to adopt automation 9x faster than non-AIOps customers. This guide provides instructions to configure PagerDuty AIOps in 90 minutes, and in that short amount of time you can see less noise and reduce toil across your services.
Create Better Services
A service is defined as a discrete piece of functionality owned by a single team. Creating better services helps our ML improve accuracy of triage context and Intelligent Alert Grouping. Refer to our Service Standards guide to learn more about how to correctly create services in PagerDuty.
Operations Console (5 minutes)
Many organizations look for a single pane of glass during incident response. The Operations Console is the best place to understand your operations' current state. It offers live visibility into incidents, and users can create customized views to triage and take immediate action on issues.
The Operations Console offers:
- Filters and Searches: Need to focus on a specific group of incidents? Leverage configurable tabular and filter components (such as Services , Priority , Escalation Policy ) to zero in your refined view.
- Customized and Sharable your Views: Want to see additional columns? Add, remove, and resize columns in the Operations Console to suit your needs. You can also share your view with your team members so everyone is working from a single source of truth in one centralized location.
- Incident Actions: Need to run an incident workflow and add bulk notes? Take a variety of actions directly within the Operations Console to minimize your MTTA and MTTR.
You can take a guided 5-minute product tour. We recommend playing around with the filters and columns and start building your views.
Automation and Orchestration (60 mins)
PagerDuty AIOps uses Event Orchestration, an automation engine that allows you to create rules that normalize and enrich your data. There are four components to Event Orchestration that you can configure:
- Integrations: Use Global Integration keys in your monitoring tools to port data to an Event Orchestration.
- Global Orchestrations: Allow multiple services to enrich data according to global sets of rules. Global Orchestrations are usually created and managed by a centralized ITOps or SRE team.
- Service Routes: Ensure that when certain conditions are met, incidents are routed to the right team.
- Service Event Orchestrations: Create per-service Orchestrations to enrich data according to the service owner’s criteria. Service Event Orchestrations are usually self-served and managed by distributed service-owning teams.
You will use these components to build the recommended Event Orchestration types below.
Incoming Data
By putting PagerDuty AIOps at the front of your event stream, PagerDuty acts as a single pane of glass across monitoring tools and is a force multiplier for automation. To get the most value and best results for noise reduction and automation, all your events should flow through PagerDuty AIOps. To learn about how we integrate with your tools, see our integrations page.
For teams just getting started, here are five types of common Event Orchestrations that we recommend creating within your account:
- Service Routing: With Global Integration keys, you can use a single key to ingest all events for all services relevant to your team. After events are ingested, you can configure Service Routing Rules to define how events are distributed to downstream teams. (5-60 mins, depending on organization size)
- Rules-based Noise Reduction: Event Orchestration has powerful noise reduction actions based on different conditions or scenarios. These tools include suppression, notification pausing, or incident dropping. Create broad conditions that describe classes of events that don't provide value to your teams. Then select the Suppress action to stop those events from generating notifications. (15 mins)
- Standardized Incident Response Procedures: For well understood incidents, add notes to incidents upon creation. These notes should define the standard response procedures for the incident. This is especially helpful to have in place for junior responders who need more context. (5-60 mins depending on existing documentation)
- Major Incident Management: With Event Orchestration variables, you can quickly identify when a major incident is occurring based on historical event data and adjust how rules are applied in those situations. This can help facilitate incident response during a major incident and ensure the right responders are being pulled in. (15 mins)
- Automated Incident Triage: Event Orchestration can automatically assign priorities and severities to events when they are ingested, ensuring incidents are treated correctly when they are related. To do this, define all the conditions that describe a high, medium, or low priority incident in your environment and specify the appropriate priority. (15 mins)
Once you’ve created these Orchestrations, you will only receive incidents relevant to your team and you’ll be able to resolve them faster with less toil and better data. Most of our customers see a 14% reduction in Mean Time to Resolve (MTTR) and are able to create automation 9X faster.
Noise reduction (20 mins)
Noise reduction is a top priority for all teams. It’s a significant quality-of-life improvement for incident responders, whether they’re a NOC operator or a DevOps engineer.
There are six ways you can set your noise level threshold:
- Intelligent Alert Grouping: Leverages ML to group related alerts based on previous incident data and human interaction. Flexible time windows allow the grouping to max out anywhere from five minutes to an hour rolling time window based on configuration.
- Content-Based Alert Grouping: Allows you to create custom alert grouping based on known fields between alerts.
- Unified Alert Grouping: Allows you to combine Content-Based Alert Grouping and Intelligent Alert Grouping with a flexible time window for increased precision and correlation control.
- Global Alert Grouping: Allows you to reduce noise by using Content-Based Alert Grouping to group alerts across multiple technical services.
- Time-Based Alert Grouping: Allows you to group alerts based on a static time increment of your choice, ranging from 2 minutes to 24 hours, or even until the incident is resolved.
- Auto-Pause Incident Notifications: Leverages ML to identify alerts that typically auto-resolve on their own and pauses incident creation. Select a duration ranging from 2-15 minutes.
PagerDuty AIOps ML Model
PagerDuty AIOps’ unique ML model is always analyzing data signals, so when you turn it on, there is minimal-to-zero training necessary. For longer-term PagerDuty customers who are new to AIOps, the ML will start working immediately, as the platform is continuously learning from responder actions. For new users, the ML will begin learning as soon as you begin using PagerDuty. The ML will become more accurate as you send more events to PagerDuty, and resolve more incidents.
We recommend that new users configure their noise reduction settings as such:
- Enable Intelligent Alert Grouping for Your Service(s): We recommend starting with 2-5 minutes. If Intelligent Alert Grouping groups an alert incorrectly while it learns, you can regroup the alert. This will train the model to group alerts like this in the future.
- Enable Auto-Pause Incident Notifications for Your Service(s): Toggle Auto-Pause Incident Notifications on to eliminate transient noise.
Intelligent Alert Grouping is the best way to see immediate noise reduction in the environment. For teams with specific alert grouping requirements, we also recommend the following:
- Enable Content-Based Alert Grouping for your service(s): With Content-Based Alert Grouping, alerts that share an exact match on a set of chosen fields will be grouped together into the most recent open incident. You can even leverage Event Orchestration to add custom details that then are grouped by content.
- Enable Time-Based Alert Grouping for your service(s): Time-Based Alert Grouping will automatically add alerts to an open incident for a predetermined period, which can be helpful for services that generate many alerts. You can set the time window you’d like to group by. We recommend 2-5 minutes.
Triage and Root Cause Analysis (1 minute)
PagerDuty AIOps comes with several ML triage features to provide you with more context during incident response. This also removes the toil of digging through documentation and postmortems to find the key information you’re looking for.
Triage and Root Cause Analysis (RCA) features require zero configuration. When you turn on AIOps, the ML algorithms begin processing data. Similar to other ML-based features, the more data you add to PagerDuty, the more effective these tools will be for you and the faster the ML will learn.
Our ML-based triage and RCA features include:
- Outlier Incident: Tells you if an incident is frequent, rare, or an anomaly. Helps responders understand how novel an incident is. The more novel, the more they might want to have help.
- Past Incidents: Shares if an incident like this has occurred in the past, as well as frequency of the incident over the last 6 months. If so, you can click into the past incident and view incident metadata – like who was involved or what remediation efforts were used.
- Related Incidents: Shows other current incidents within the system that may be related. This helps you understand dependencies and cascading impact of an incident, and it allows you to provide feedback to improve recommendations over time.
- Probable Origin: Determines the most likely service origin of the incident. This scopes the incident and tells you which team to coordinate with for more insights.
- Change Correlation: Uses Machine Learning to correlate which incidents happen with certain changes. The incident shows recent changes on that service or related services. As most incidents are change-related, this gives you a jump start on triage.
Resources
PagerDuty AIOps Support
If you’re a current PagerDuty AIOps customer and you have further questions, please read our PagerDuty AIOps article, or reach out to your account team. If you’re interested in PagerDuty AIOps, you can sign up for a trial or take our interactive product tour.
Deeper Learning
If you'd like to go deeper into the information presented in this article, you can enroll in our PagerDuty University AIOps Knowledge Series, and work towards becoming a PagerDuty-certified AIOps specialist.
Updated 2 months ago