SRE Agent
A continuously learning agent to help rapidly diagnose, troubleshoot, and remediate issues
PagerDuty's SRE Agent transforms incident response in the Operations Console by automatically analyzing incidents, providing key context, and recommending remediation actions. It reduces the risk and cost of operational failures by accelerating triage, decreases cognitive load during critical incidents, and helps prevent recurring issues by continuously learning from past remediation paths.
The SRE Agent helps reduces the risk and cost of operational failures by accelerating triage, decreases cognitive load during critical incidents, and helps prevent recurring issues by continuously learning from past remediation paths. Users can provide the SRE Agent with diagnostic information such as error logs as well as Runbook and SOP documentation.
Release Status and Availability
SRE Agent is in Early Access with features and documentation subject to change.
SRE Agent requires the AIOps add-on (or an AIOps trial) and PagerDuty Advance (or a PagerDuty Advance trial). PagerDuty Advance is available through one-time credits or as an add-on with the following pricing plans:
- Enterprise
- Business
- Professional
Please contact our Sales Team to upgrade to a pricing plan with PagerDuty Advance. If you do not currently have a PagerDuty Advance trial, we will begin a trial in order to give you access to the SRE Agent.
Note: Please note that while SRE Agent is in Early Access, usage credits will not be consumed. Credits will be consumed once the SRE Agent is released for General Availability.
Please read our PagerDuty Advance AI Disclosure for more information about how we designed, built and assessed PagerDuty Advance with mission critical work in mind.
Overview
The goal of the SRE Agent is to work side-by-side with users, collecting information and learning the ultimate resolution. The SRE Agent will always provide a summary and collect feedback from the user ensuring a continuously improving system. Once users triage a given incident, and provide information, that information will be available for future incidents.
Use SRE Agent
- In the PagerDuty web app, navigate to AIOps Operations Console.
- Optional: Add the SRE Agent column to the Operations Console for quicker access to the agent.

Add SRE Agent to the Operations Console
- Select you preferred incident's title.
- Select the tab SRE Agent and wait a moment for the agent to load your incident summary and suggested next steps.
- Begin troubleshooting the incident with the agent by asking questions and providing information
- Use “nudges” or contextual buttons provided by the SRE Agent to take prescribed action during the conversation, for example Upload a service runbook.
SRE Agent Scope
Currently, the SRE Agent will only analyze incidents that source is from the Events API. Incidents from email events or manually created incidents will not be included during Early Access. These sources may be added during the Early Access period, please feel free to inquire with the product team directly for further information.
Key Features
- Onboarding and Agreement
- Greets users with a welcome message and outlines the terms of use.
- Knowledge Ingestion
- Allows users to upload runbooks and Standard Operating Procedures (SOPs) for analysis.
- Supports uploading and analysis of diagnostics and error logs.
- Automated Playbook Generation
- Generates playbooks for recurring issues, which can be saved and reused for future incidents.
- Incident Analysis and Resolution
- Analyzes incident data to identify patterns and frequency of issues.
- Recalls and reviews past similar incidents to surface previous resolutions.
- Identifies potential root causes based on the information provided.
- Recommends specific actions to diagnose and resolve incidents.
- Helps prioritize actions based on incident urgency and impact.
- Guided Troubleshooting
- Provides structured troubleshooting steps for incidents, service disruptions, and infrastructure errors.
- Organizes all information in a clear, actionable format.
- Conversation Management
- Offers interactive “nudges” or action buttons to guide users through the resolution process.
- Marks conversations as resolved once the issue is fixed.
- Summarizes the incident conversation, including user-provided information (such as error logs and runbooks), and saves it for future reference.
Example Questions
The following questions below are representative of the types of questions the SRE Agent can answer. The SRE Agent leverages large-language- models where questions do not have to provided in the exact text format as shown below.
Example Question | Description |
---|---|
Can you analyze past incidents to see how this was resolved before? | Ability to see what similar incidents occurred previously for that service |
Can you provide a list of related incidents? | Ability to see what active incidents that might be related on services that are not your service |
How do I check [insert service, infrastructure information] for this specific error? | Ability to ask questions related to the type of incident you are troubleshooting |
What information should I gather for this incident to help troubleshoot? | Ability to understand what type of information is needed to troubleshoot the incident |
How urgent is this incident based on the data? | Ability to understand incident urgency |
What steps should I take to troubleshoot this issue? | Suggested remediation steps |
Can you generate a playbook for resolving this error? | SRE Agent develops a playbook based on the agent’s understanding of the incident |
How do I check the logs for this specific error? | Instructions for checking logs, may include some sample query |
How can I prevent this error from recurring? | Suggestions on how to improve the incident for the future such as service or infrastructure improvements |
Is there a pattern to when these errors occur? | Analysis of incident patterns |
What's the impact of this error on our systems? | Potential impact on other related services or infrastructure based on the incident context |
What are some likely root causes for this incident? | Suggested root-cause for the incident |
SRE Agent Memory Definitions
- SRE Incident Playbook ("Scratchpad")
- Dynamic knowledge base that learns from historical incident data
- Generates prioritized resolution steps based on patterns from past incidents
- Uses AI to provide baseline recommendations even when no historical data is available
- Customer Service Runbook
- Accepts runbook or Standard Operating Procedures (SOPs) from knowledge bases (like Confluence) provided by users manually to the agent and references for the future
- Incident Summarization
- Automatically generates comprehensive summaries upon incident completion
Captures key learnings and resolution details for future use
- Automatically generates comprehensive summaries upon incident completion
Best Practices
- Provide Relevant Documentation and Context
- Upload runbooks, SOPs, or knowledge base articles related to the affected service or architecture.
- Interact with the Agent
- Treat the SRE Agent as your "triage buddy" - ask questions whenever you get stuck during troubleshooting.
- Share critical findings or remediation steps you've taken during the incident to keep the agent informed and help the agent learn over time.
- Provide Performance Feedback
- Report the outcome of each suggested troubleshooting step (i.e., success or failure).
- When steps fail, inform the SRE Agent so it can offer alternative recommendations to keep the triage process moving forward.
- Generate Incident Playbooks
- Request an incident playbook summary at the end of each incident.
- Review the generated playbook for accuracy and completeness.
- Copy approved playbooks to your knowledge base for future reference.
- Rate AI Responses
- Use the "Rate AI Response" option available with each suggestion or recommendation in the conversation window.
- Provide feedback to help improve both the product and future recommendations.
- Your input helps the system learn and deliver better assistance over time.
FAQ
What incident data can the SRE Agent analyze?
During Early Access, the SRE Agent analyzes:
- Event and alert payload information
- Historical and related incidents
- Change events
- User-provided data (runbooks, logs, documentation)
Current limitations: Limited access to incident timeline details, incident workflows, and alert grouping data. These features will be added in future releases to enhance SRE Agent capabilities.
What happens if a recommendation or incident summary is not correct?
Interact with the agent and let it know that the suggestion provided was not helpful, why it was not helpful, and ask for an alternative recommendation. Also, you should rate the response for the recommendation which will be analyzed to improve future recommendations.
What file types and limits exist for file upload?
We currently support .txt
and .md
files. One conversation can have up to a total of 25 files, with each file being 100 Kb maximum.
Updated 3 days ago