Proactively Finding Problems Before Customers Notice

David Peček
Feb 13, 2020
4 min read

Updated: Sep 13, 2020

Sounds simple right? We should know about problems with our product before our customers do. However if you are working with a legacy application stack or your organization does not include operations as a development stakeholder; applications may not have been engineered to report issues or self heal.

Discover the health metrics you need from your applications and implement monitors around them to proactively alert support teams of potential issues.

Areas to Look

There are many facets to look at when considering the health of an application. These are some of the more common areas I have seen issues.

Synthetic monitors. A great place to start, these applications pretend to be a customer by logging into your system and creating transactions. When these alerts trigger, its usually an incident as real customers won't have any more luck than your monitor.
Application response times. Here you are looking for your applications to take X% longer to load. Usually a symptom of crashes can be anywhere from 25-100% longer load times on average over a span of 1-10 minutes.
Exceptions being logged. Through log aggregators you are able to tell when applications are starting to throw a significant number of exceptions within a given timeframe. This is a key indicator your application is not healthy. One thing to note here is not all exceptions mean bad things, so while alerts will come from this system, does not always mean you have an imminent crash.
Memory and disk usage. A classic indicator of problems, when applications start running up against the limit of their configured heap space or have no more disk space you will start to see degraded performance. After having tuned your applications for correct amounts of disk and memory, any alerts about these issues should be taken seriously as the application is more likely to become unhealthy in the future.
500 level errors being thrown. If any of your gateways are throwing errors in the 500 range, the application itself or the network has something unhealthy, this should be an immediate notification and action for an operations team.
Broken pipes. While not always a problem as customers can press stop on their browsers, refresh, or APIs will see disconnects, its worth noting many broken pipes can mean customers cannot get to your applications or the applications are not responding.
Applications crashed. Has an application crashed and is no longer running on a host? Is it unable to restart itself? You should be alerted of this so you can go in and manually fix or create a new node.
Data corruption scenarios. From the known healthy state of the data hierarchy in your system, where do parts of that data not fit into the expected structure or layout?
Data consistency issues. Many applications have data flowing through systems and therefore databases. Where is data missing or inconsistent between these systems?

How to Monitor

Next up lets go over based on each one of these types of issues what kind of proactive monitoring can you implement to make sure you are alerted of these things?

Synthetic monitors. Popular vendors for application monitoring are Pingdom and DataDog. They allow you to easily setup logging in and doing a transaction in the system as though you were a customer from a simple web interface.
Application response times / memory and disk usage / broken pipes. There are many commercially available products available on the market now to watch for variations in standard response times like Dynatrace or SolarWinds.
Exceptions being logged. Commercially available log aggregators such as Sentry, Raygun or Rollbar can alert you of these kinds of clustered exception events.
500 level errors being thrown. If you want the most accurate errors from your gateway, going to the source of where it is hosted can be your best bet. In that case based on your platform AWS CloudWatch or Google Cloud Monitoring can tell you when customers are getting these errors.
Data corruption scenarios and data consistency issues. For this type of monitoring you will need to use a data visualization tool like Redash or some python code in an AWS lambda function to look for scenarios in the database and then proactively alert via SNS, webhooks, or email.

How to Fix

Ok great! Now you are getting immediate alerts on these issues. What is the fastest way you can now overcome them before that first chat, email or call from a customer?

Application response times / memory and disk usage / broken pipes. Depending on the platform your applications are running on, if you can in an automated or manual fashion remove this instance from the load balancer and create a new one to replace it. If you remove the instance, this gives you the ability if needed to further debug what went wrong so you can learn and prevent this instability in the future.
Exceptions being logged. If you have some engineers who are knowledgable about the code base on your team they can probably quickly figure out what the issue is, if not can escalate to development for an answer on the root cause.
500 level errors being thrown. In this case you probably want to try and reproduce the error yourself and then based on the specific code either: review the networking situation with IT, or restart applications behind the gateway.
Data corruption scenarios and data consistency issues. The database monitors you setup should alert into an operations monitoring system. The alerts should go to this system and be routed to the correct team who can address.

As these alerts come up, look at the process of how people address them. Make incremental improvements each time to ensure the right people get alerted so you can have the fastest recovery to these issues. And don't forget: each time something like this comes up: always be thinking how you can solve the root cause of the issue so you don't have to deal with it again.

Proactively Finding Problems Before Customers Notice

Areas to Look

How to Monitor

How to Fix

Recent Posts

Comments