For many teams, adopting PagerDuty AIOps is key to seeing fewer incidents and faster resolution. By leveraging Machine Learning (ML) and automation, our customers see 14% faster resolution, 87% fewer incidents, and are able to adopt automation 9x faster than non-AIOps customers. This guide provides instructions to configure PagerDuty AIOps in 90 minutes, and in that short amount of time you can see less noise and reduce toil across your services.
Create Better Services
A service is defined as a discrete piece of functionality owned by a single team. Creating better services helps our ML improve accuracy of triage context and Intelligent Alert Grouping. To learn more about how to correctly create services within PagerDuty, you can use our Service Standards guide.
PagerDuty AIOps uses Event Orchestration, an automation engine that allows you to create rules that normalize and enrich your data. There are four components to Event Orchestration that you can configure:
- Integrations: Use Global Integration keys in your monitoring tools to port data to an Event Orchestration.
- Global Orchestrations: Allow multiple services to enrich data according to global sets of rules. Global Orchestrations are usually created and managed by a centralized ITOps or SRE team.
- Service Routes: Ensure that when certain conditions are met, incidents are routed to the right team.
- Service Event Orchestrations: Create per-service Orchestrations to enrich data according to the service owner’s criteria. Service Event Orchestrations are usually self-served and managed by distributed service-owning teams.
You will use these components to build the recommended Event Orchestration types below.
By putting PagerDuty AIOps at the front of your event stream, PagerDuty acts as a single pane of glass across monitoring tools and is a force multiplier for automation. To get the most value and best results for noise reduction and automation, all your events should flow through PagerDuty AIOps. To learn about how we integrate with your tools, see our integrations page.
For teams just getting started, here are five types of common Event Orchestrations that we recommend creating within your account:
- Service Routing: With Global Integration keys, you can use a single key to ingest all events for all services relevant to your team. After events are ingested, you can configure Service Routing Rules to define how events are distributed to downstream teams. (5-60 mins, depending on organization size)
- Rules-based Noise Reduction: Event Orchestration has powerful noise reduction actions based on different conditions or scenarios. These tools include suppression, notification pausing, or incident dropping. Create broad conditions that describe classes of events that don't provide value to your teams. Then select the Suppress action to stop those events from generating notifications. (15 mins)
- Standardized Incident Response Procedures: For well understood incidents, add notes to incidents upon creation. These notes should define the standard response procedures for the incident. This is especially helpful to have in place for junior responders who need more context. (5-60 mins depending on existing documentation)
- Major Incident Management: With Event Orchestration contextual conditions like thresholds, you can quickly identify when a major incident is occurring and adjust how rules are applied in those situations. This can help facilitate incident response during a major incident and ensure the right responders are being pulled in. (15 mins)
- Automated Incident Triage: Event Orchestration can automatically assign priorities and severities to events when they are ingested, ensuring incidents are treated correctly when they are related. To do this, define all the conditions that describe a high, medium, or low priority incident in your environment and specify the appropriate priority. (15 mins)
Once you’ve created these Orchestrations, you will only receive incidents relevant to your team and you’ll be able to resolve them faster with less toil and better data. Most of our customers see a 14% reduction in Mean Time to Resolve (MTTR) and are able to create automation 9X faster.
Noise reduction is a top priority for all teams. It’s a significant quality-of-life improvement for incident responders, whether they’re a NOC operator or a DevOps engineer.
There are four ways you can set your noise level threshold:
- Intelligent Alert Grouping: Leverages ML to group related alerts based on previous incident data and human interaction. Flexible time windows allow the grouping to max out anywhere from five minutes to an hour rolling time window based on configuration.
- Content-Based Alert Grouping: Allows you to create custom alert grouping based on known fields between alerts.
- Time-Based Alert Grouping: Allows you to create grouping based on a static time increment of your choice ranging from 2 minutes to 24 hours, or even until the incident is resolved.
- Auto-Pause Incident Notifications: Leverages ML to identify alerts that typically auto-resolve on their own and pauses incident creation. Select a duration ranging from 2-15 minutes.
PagerDuty AIOps ML Model
PagerDuty AIOps’ unique ML model is always analyzing data signals, so when you turn it on, there is minimal-to-zero training necessary. For longer-term PagerDuty customers who are new to AIOps, the ML will start working immediately, as the platform is continuously learning from responder actions. For new users, the ML will begin learning as soon as you begin using PagerDuty. The ML will become more accurate as you send more events to PagerDuty, and resolve more incidents.
We recommend that new users configure their noise reduction settings as such:
- Enable Intelligent Alert Grouping for Your Service(s): We recommend starting with 2-5 minutes. If Intelligent Alert Grouping groups an alert incorrectly while it learns, you can regroup the alert. This will train the model to group alerts like this in the future.
- Enable Auto-Pause Incident Notifications for Your Service(s): Toggle Auto-Pause Incident Notifications on to eliminate transient noise.
Intelligent Alert Grouping is the best way to see immediate noise reduction in the environment. For teams with specific alert grouping requirements, we also recommend the following:
- Enable Content-Based Alert Grouping for your service(s): With Content-Based Alert Grouping, alerts that share an exact match on a set of chosen fields will be grouped together into the most recent open incident. You can even leverage Event Orchestration to add custom details that then are grouped by content.
- Enable Time-Based Alert Grouping for your service(s): Time-Based Alert Grouping will automatically add alerts to an open incident for a predetermined period, which can be helpful for services that generate many alerts. You can set the time window you’d like to group by. We recommend 2-5 minutes.
PagerDuty AIOps comes with several ML triage features to provide you with more context during incident response. This also removes the toil of digging through documentation and postmortems to find the key information you’re looking for.
Triage and Root Cause Analysis (RCA) features require zero configuration. When you turn on AIOps, the ML algorithms begin processing data. Similar to other ML-based features, the more data you add to PagerDuty, the more effective these tools will be for you and the faster the ML will learn.
Our ML-based triage and RCA features include:
- Outlier Incident: Tells you if an incident is frequent, rare, or an anomaly. Helps responders understand how novel an incident is. The more novel, the more they might want to have help.
- Past Incidents: Shares if an incident like this has occurred in the past, as well as frequency of the incident over the last 6 months. If so, you can click into the past incident and view incident metadata – like who was involved or what remediation efforts were used.
- Related Incidents: Shows other current incidents within the system that may be related. This helps you understand dependencies and cascading impact of an incident, and it allows you to provide feedback to improve recommendations over time.
- Probable Origin: Determines the most likely service origin of the incident. This scopes the incident and tells you which team to coordinate with for more insights.
- Change Correlation: Uses Machine Learning to correlate which incidents happen with certain changes. The incident shows recent changes on that service or related services. As most incidents are change-related, this gives you a jump start on triage.
Many organizations look for a single pane of glass during incident response. The Visibility Console is the best place to understand the current state of the system. Especially for ITOps teams, this is command central. With its modular design, it’s also quick and intuitive to configure to your organization’s needs. Just select the module you want and the data will populate. Modules include:
- Markdown: Allows you to add any text or notes to your console. This could be used to display instructions, add hyperlinks to frequently-used documentation, link to runbooks, or take general notes during an on-call shift or investigation.
- Incidents: Displays a real-time list, based on the filters selected, of the most recent open incidents that have been triggered.
- Services: Displays the associated team (if there is one configured), the number of open incidents, the time of the last incident, and the current state of the service.
- Custom URL: Allows you to embed external web pages into an iframe, in order to add external monitoring to the Visibility Console. Any PagerDuty webpage is embeddable, as well as many external status pages and public dashboard URLs.
- Status Dashboard: Allows you to see service health Status Dashboard information so you can monitor your business services in one place.
- Service Activity: Allows you to see incident activity across many of your services at once, so you can quickly understand your digital operations health, and identify if there’s a widespread major incident.
- On-Call Responders: Provides a quick and easy way to search which users are on call across all escalation policies.
- Incidents and Changes Timeline: Displays a timeline of incidents and change events across all services. The module can be filtered by the last 7 days, 24 hours, 12 hours, 6 hours or 1 hour.
We recommend starting with Status Dashboard, Service Activity, and Incidents.
If you’re a current PagerDuty AIOps customer and you have further questions, please read our PagerDuty AIOps article, or reach out to your account team. If you’re interested in PagerDuty AIOps, you can sign up for a trial or take our interactive product tour.
If you'd like to go deeper into the information presented in this article, you can enroll in our PagerDuty University AIOps Knowledge Series, and work towards becoming a PagerDuty-certified AIOps specialist.
Updated 4 months ago