There are 5 steps in the Incident Response Cycle:
Addressing each step in the incident response cycle will ultimately help your team drive down incident resolution times by (1) focusing on actionable alerts, (2) getting the right people to review those alerts, and (3) understanding the full impact of the problem so that (4) the appropriate remediation steps can be taken and (5) reviewed to make sure the same problem does not happen again. In this article, we will discuss Step 1: Optimize.
- What does "Notify" mean?
- What kinds of common business or operational challenges come up when we do not optimize our alerting?
Notifying means alerting the right people at the right time about an alert. When you leverage tools to notify your team properly, the right people can come together quickly to start working on an incident, reducing your response and resolution times to minimize the overall business impact of the problem.
What kinds of common business or operational challenges come up when we do not optimize our alerting?
- We are notifying too many people about a single event.
- We are taking too long to get the right people to investigate an issue.
- We are missing or having trouble staying compliant with our SLAs.
- We don't have an automated process for escalating alerts.
When too many people are notified at the same time, 1 or 2 things may start to happen -
- There is confusion about who should actually be working and resolving the alert
- Alert fatigue kicks in when teams are constantly bombarded with notifications that other users can help resolve.
Feature: Escalation policies
With PagerDuty’s escalation policies, you can control exactly how many people and who should be notified when an incident is triggered. This allows you to bring in your key resources to work on an incident right away without having to bother others working on important projects.
When you target your notifications to the right people who will be able get the job done, you drive down your resolution times by spending less time firefighting and figuring out who owns an alert.
Feature: On-call schedules
When too many people are being notified at the same time, you may need to break out who gets notified and when based on the time of the day and day of the week. PagerDuty schedules allow you to do just that by creating customized schedules based on varying rotation types.
Being able to control what hours of the day your team is notified can allow you to distribute work across teams in multiple timezones, giving each regional team a break from getting notified outside of business hours.
When people with the wrong skill set are involved in an incident, keeping your incident resolution time low becomes a struggle as teams try to figure out who should investigate the issue and how to get in contact with them.
Feature: Services and escalation policies
If you are taking too long to get the right people to look at an issue, then you may be targeting your alerts to the wrong team or person. With PagerDuty, you can route your alerts to the right group of people immediately after an incident is triggered.
When incidents are triggered on a service, the incident is immediately assigned to the person(s) on-call in the escalation policy that is tied to that service. By directing your incidents to the right service, you can target exactly which escalation policy or team should be responsible for each alert so that the right people are notified about the problems that they can immediately solve.
Feature: Reassign incidents
If you begin work on an incident and realize that the incident needs to be resolved by a different team or subject matter expert, you can reassign the incident to a different level of your escalation policy, a different escalation policy, or a specific user. When incidents are reassigned, PagerDuty triggers notifications to the person(s) newly assigned to the incident based on the notification rules configured in their user profiles. If an incident is reassigned to an escalation policy, PagerDuty will automatically follow that escalation policy’s rules, eliminating any previous manual processes that may have been taken to get the right person from the right team on an incident.
With this feature, you won’t need to chase down the right person to review an incident, as PagerDuty will be able to automate the notification and escalation process for you, ultimately giving you more time to focus on the parts of your infrastructure that you can maintain.
If you have internal or external SLAs, then you have a commitment to respond (and resolve) problems within a certain amount of time. Meeting this SLA may mean keeping your customers happy or preventing a small issue from compounding into a larger one by holding your team accountable to supporting the systems and microservices that they own.
Feature: Escalation policies
Escalation policies are designed to automatically escalate incidents to the next person on-call when the primary on-call person does not respond within the escalation timeout period. By customizing your escalation timeout periods between levels, you can determine how fast an incident should automatically escalate in order to:
- Emphasize the importance of an incident to your team. For example, if your escalation timeout period is set to 5 minutes, this creates a sense of urgency for the primary on-call person to respond to the problem before it escalates to the next level (which could be a backup on-call or a team manager).
- Bring in reinforcements when the primary on-call is not able to respond right away. The primary on-call may be driving or working on a separate problem and is not able to immediately respond. Instead of letting the alert sit in queue, PagerDuty can automatically escalate it to the next available on-call responder.
When you can automate and control your escalation process, you drive down the time it takes for teams begin acknowledging and working on a problem, helping your team meet business critical SLAs.
Feature: Notification rules
Meeting SLAs can be problematic when the people who should be working on an issue are not notified via the appropriate channels. PagerDuty allows users to create and customize their own notification rules, which determine exactly when a notification should be sent to them under a contact method of their choice - phone call, SMS, email, or push notification. Each user can configure minute-by-minute notifications to ensure none of their assigned incidents slips through the cracks, reducing the time it takes for them to acknowledge and resolve the incident without breaking SLA.
Escalation policies and the ability to reassign incidents to different policies or users are designed to create an automated processes for escalating alerts. However, there is one more feature that can also help be a safety net when an incident takes too long to resolve.
Feature: Incident acknowledgement timeot
Incident acknowledgement timeouts determine when an incident should re-notify the person(s) on-call (plus whoever acknowledged the incident) if the incident stays acknowledged for too long. When an incident stays acknowledged for too long, it may indicate that the person may have forgotten to work on the incident (i.e. bound to happen when somebody acknowledges an incident with their eyes closed while in bed) or that they are taking too long to resolve the incident.
When an incident’s acknowledgment has timed out, the person(s) on-call is re-notified, and the incident will continue to escalate if it is not acknowledged within the escalation timeout period.
This automated re-notification and re-escalation process can help ensure that incidents do not fall through the cracks when they need to be escalated.
Incident acknowledgement timeouts is a service level setting. The default is 30 minutes, however it is recommend to configure your acknowledgement timeout period to give the responder enough time to resolve an incident after it has been acknowledged. For example, if incidents on your service generally take 45 minutes to resolve, then set your ack timeout setting to at least 50 minutes. Note that users can snooze an incident to delay the incident acknowledgement timeout period for incidents that take a little longer to resolve.