Reducing Operational Alert Fatigue

David Peček
Jun 10, 2019
3 min read

Updated: Sep 11, 2020

If you have ever worked in the operations of a company, you are aware of how many sources are trying to alert you of issues going on with monitoring of various applications and infrastructure components having problems. These come from emails, instant messages, and people directly contacting you about production issues. At first there is hope and you try to stay on top of it, but as time goes on you become numb to the alerts as there are usually too many to deal with. What can be done to solve this conundrum?

Make a consolidated effort to coalesce all of your alerting into one platform which is able to ingest the various types of alerts you should be paying attention to, and schedule who sees them when.

Quantify What Matters

A good first step in the process is categorizing all of the sources which are alerting you so you know which problems you are trying to solve from various pieces of data collected throughout the organization.

System downs. Likely the most important notification you can get, and it should be prioritized over other alerts you are getting. Which systems are critical and the alerts you should pay attention to first? Is there a secondary set of systems which are not as critical but should be looked at quickly?

Performance issues. These are usually a good indicator of a future system crash. Whether or not your system auto-scales, if its slowing down this means there is a bottleneck your applications are not able to overcome which will likely need manual intervention.

Repeating errors / exceptions. These are issues which are occurring in a significant volume. What is the threshold of a type of exception you would be seeing from an application where it should be alerted?

Data inconsistencies or corruption. You likely have monitoring tools in place already that already look for problems with your data. Can they report into a tool or publish a message?

From the list you have generated here which of these needs immediate attention? Ensure those go out to the on call team for immediate attention. Let the rest go into a queue which is is reviewed at some point later on.

Scheduling Alerts

Something which will make your ops users most happy is knowing they are not always expected to be on call. If you are coming from a small to mid size company likely the operational departments have grown from a small mom and pop style organization to their current form. They have likely been the ones solely responsible for holding things together since the beginning. Some concepts to consider when figuring out how to schedule people based on their skills and roles:

NOC engineers: usually the first line of defense, these users should have runbooks ready to go which guide them through the process of correcting any alerts that come up. These users should mainly be getting errors about system and application outages as that is their speciality.

Devops / SRE's: your second line of defense. This team should get escalations from NOC, such as outages they cannot correct, as well as software exceptions and data problems. This team should only get actual alerts off hours when there are system outages. Otherwise the alerts about exceptions and corruption can be in their queue for when they next login.

Managers: this is the last tier of notifications. Manager alerts should come from DevOps when they are unable to solve the issue as a last resort measure.

Having categorized escalations between different teams to spread the load of notifications and on call schedules should help to eliminate operational alert fatigue. You will hopefully see improved response times as well since alerts will have more meaning and be less often.

Reducing Operational Alert Fatigue

Quantify What Matters

Scheduling Alerts

Recent Posts

Comments