SRE Agent

A continuously learning agent to help rapidly diagnose, troubleshoot, and remediate issues

PagerDuty's SRE Agent transforms incident response in the Operations Console and Slack by automatically analyzing incidents, providing key context, and recommending remediation actions. It accelerates triage to reduce risk, cost, and cognitive load, and it continuously learns to prevent repeat issues.

📘

Availability

To access the SRE Agent within Operations Console, you must have AIOps and PagerDuty Advance. To access SRE Agent in Slack, you must have PagerDuty Advance.

PagerDuty Advance is available through one-time credits or as an add-on with the following pricing plans:

  • Enterprise
  • Business
  • Professional

Please contact our Sales Team to upgrade to a pricing plan with PagerDuty Advance or AIOps. If you do not currently have PagerDuty Advance or AIOps, we will begin a trial in order to give you access to the SRE Agent.

🚧

PagerDuty Advance Disclosure

Please read our PagerDuty Advance AI Disclosure for more information about how we designed, built and assessed PagerDuty Advance with mission critical work in mind.

Overview

The goal of the SRE Agent is to work side-by-side with users, collecting information and learning the ultimate resolution. The SRE Agent will always provide a summary and collect feedback from the user ensuring a continuously improving system. Once users triage a given incident, and provide information, that information will be available for future incidents.

Key Features

  • Ingest and analyze user runbooks, SOPs, and diagnostics (e.g., error logs)
  • Generate and save playbooks for recurring issues
  • Prioritize actions by urgency and impact
  • Surface likely root causes from available data
  • Recommend diagnostic and remediation steps
  • Detect patterns; recall similar incidents and past resolutions
  • Provide structured troubleshooting for incidents, services, and infrastructure
  • Organize context into actionable views with interactive nudges/buttons
  • Summarize conversations and inputs, mark resolved, and save learnings

📘

PagerDuty Advance Credits

Accounts with PagerDuty Advance have an allotment of credits at their disposal. The SRE Agent uses four credits whenever you:

  • Submit a request to the SRE Agent via Slack or Operations Console (e.g., “what are my past incidents,” “what is the likely root cause”)
  • Click a nudge button (e.g., Update a Runbook, Analyze Past Incidents)
    Please refer to How many credits does each action cost for more information.

Use SRE Agent

After configuring PagerDuty Advance, you can access the SRE Agent through either:

In the Operations Console

  1. In the PagerDuty web app, navigate to AIOps Operations Console.

    1. Optional: Add the SRE Agent column to the Operations Console as a default column for faster incident triage.

      SRE Agent Column in Operations Console

      SRE Agent Column in the Operations Console

  2. Select an incident by clicking on the incident Title.

  3. Select the SRE Agent tab and wait for the agent to load your incident summary and suggest next steps.

  4. Begin troubleshooting the incident with the agent by asking questions and providing information.

  5. Use the SRE Agent’s “nudges” or buttons to take prescribed action during the conversation. For example, Upload a service runbook.

  6. Upload additional files by adding an attachment.

In Slack

📘

Before your begin

You will need to configure Slack with the SRE Agent. Please see our instructions to Configure the Slack Integration and Connect PagerDuty Advance to Slack if you have not already done so.

👍

Note

If you do not have PagerDuty AIOps, you may still access certain PagerDuty AIOps information, such as related incidents, past incidents, change events and outlier incidents, but only within the Slack chat interface. This information will not be available outside of the agent’s Slack chat interface (e.g. in the Incident Management web app).

  1. Access the SRE Agent in the following channel types:
    1. Dedicated incident channels
    2. Team or service based Slack channels
  2. Start the SRE Agent by:
    1. Selecting the SRE Agent Triage button
    2. Asking questions in the chat. Use@pagerduty with a question related to the incident. See the list of example questions.
  3. Upload a file to the SRE Agent in Slack:
    1. Click the upload runbook or update runbook button.
    2. Follow the prompts and select the runbook to upload.
    3. Press Submit.
Slack Interface with SRE Agent

Slack Interface with SRE Agent

Integrations

SRE Agent can retrieve log data from observability platforms such as Grafana, Datadog, AWS CloudWatch and runbooks from sources like Confluence and GitHub. By analyzing these logs and runbooks, SRE Agent guides responders through investigation, triage, and resolution—ultimately reducing MTTR and escalations. When setting up one of these workflow integrations, select Allow SRE Agent access to use this connection.

📘

Setting up an Integration

For more on setting up SRE Agent integrations, please view the article on Agent Tooling Configuration.

Supported Actions

The SRE Agent uses nudges to recommend supported actions such as:

  • Upload Runbook: For first-time setup on a service
  • Update Runbook: When a runbook already exists
  • Analyze Past Incidents: Review history and patterns
  • Analyze Related Incidents: Identify correlations and impact
  • Generate a Playbook: Create repeatable response steps
  • Check Change Events: Verify recent changes for possible cause

Example Questions

The following questions below are representative of the types of questions the SRE Agent can answer. The SRE Agent leverages large-language-models where questions do not have to be provided in the exact text format as shown below.

Example Question to SRE AgentDescription
Can you analyze past incidents to see how this was resolved before?Ability to see what similar incidents occurred previously for that service
Can you provide a list of related incidents?Ability to see what active incidents that might be related on services that are not your service
How do I check [insert service, infrastructure information] for this specific error?Ability to ask questions related to the type of incident you are troubleshooting
What information should I gather for this incident to help troubleshoot?Ability to understand what type of information is needed to troubleshoot the incident
Should I do step X or step Y first to troubleshoot this incident?From the list of suggested next steps, it recommends which step to take first, with additional context on its reasoning
How urgent is this incident based on the data?Ability to understand incident urgency
What steps should I take to troubleshoot this issue?Suggested remediation steps
Can you generate a playbook for resolving this error?SRE Agent develops a playbook based on the agent’s understanding of the incident
How do I check the logs for this specific error?Instructions for checking logs, may include some sample query
How can I prevent this error from recurring?Suggestions on how to improve the incident for the future such as service or infrastructure improvements
Is there a pattern to when these errors occur?Analysis of incident patterns
What's the impact of this error on our systems?Potential impact on other related services or infrastructure based on the incident context
What are some likely root causes for this incident?Suggested root-cause for the incident

SRE Agent Memory Definitions

The SRE Agent maintains several types of memory to provide increasingly relevant and personalized assistance over time. Understanding these memory types helps you leverage the agent's full capabilities:

SRE Incident Playbook ("Scratchpad")

  • Continuously learns from your organization's historical incident data.
  • Automatically generates prioritized resolution steps tailored to your environment by analyzing patterns from past incidents.
  • Provides AI-powered baseline recommendations, even for novel incidents.
  • Delivers context-aware troubleshooting guidance that improves with each resolved incident.

Customer Service Runbook

  • Stores and references your runbooks, SOPs and documentation (e.g., Confluence or GitHub pages).
  • Remembers manually-provided documentation for future conversations.
  • Recommends steps that align with your organization's established procedures and standards.

Incident Summarization

  • Automatically creates comprehensive summaries when incidents are marked complete.
  • Captures key learnings, resolution details and troubleshooting paths without manual documentation effort.
  • Builds an ever-growing repository of institutional knowledge that benefits future incident response.

Best Practices

Use these quick tips to get the most from the SRE Agent—share context fast, collaborate in triage, and improve outcomes over time.

Provide Relevant Documentation and Context

Upload any runbooks, SOPs, or knowledge base articles related to the affected service or architecture so everyone has the right context.

Interact with the Agent

Treat the SRE Agent like your triage buddy and ask questions whenever you get stuck. Share any critical findings or remediation steps you’ve taken during the incident so the agent stays informed and learns over time.

Provide Performance Feedback

Report whether each suggested troubleshooting step was a success or a failure. If a step fails, tell the SRE Agent so it can suggest alternative actions and keep the triage moving forward.

Generate Incident Playbooks

At the end of each incident, request a summary playbook. Review it for accuracy and completeness, then copy the approved version to your knowledge base for future use.

Rate AI Responses

Use the “Rate AI Response” option next to each suggestion or recommendation in the conversation. Provide feedback to help improve the product and future recommendations. Your input helps the system learn and deliver better assistance over time.

Rate AI Response Button

Rate AI Response Button

FAQ

What incident data can the SRE Agent analyze?

The SRE Agent analyzes:

  • Event and alert payload information
  • Historical and related incidents
  • Change events
  • User-provided data (runbooks, logs, documentation)

Current limitations: Limited access to incident timeline details, incident workflows, and alert grouping data. These features will be added in future releases to enhance SRE Agent capabilities.

What happens if a recommendation or incident summary is not correct?

Interact with the agent and let it know that the suggestion provided was not helpful, why it was not helpful, and ask for an alternative recommendation. Also, you should rate the response for the recommendation which will be analyzed to improve future recommendations.

What file types and limits exist for file upload?

We currently support .txt and .md files. One conversation can have up to a total of 25 files, with each file being 100 Kb maximum.