Remove Humans from Your Software

David Peček
Dec 10, 2018
3 min read

Updated: Sep 6, 2020

Software from larger organizations can be laborious and complex to keep running. Systems can stem from older designs and technologies which might have had inherent flaws in the way they were designed and from the frameworks used. This has made them over the years have ins and outs which operational folks have had to workaround and maintain, as there was never a clear cut easy path to fixing some of the systemic issues.

It is vital to pay attention to the first signs of manual intervention needed in every software application, rigorously track and remove or the software will become part human dependent for its functionality.

Notice Early Warning Signs

Here is the list of the most common things I have seen which led to software becoming run in part by humans. Take note and quantify impact each time one of these come up. Don't let them get away from you as they tend to be slippery, work their way in, and quickly seem to become part of daily routines.

Each day an application requires a restart at certain intervals so it remains stable.
Monitoring programs watch for issues who regularly alert NOC of actions which need to be taken to correct.
Data cleanup scripts must be run each day to cleanup corruption caused by an application.
Applications or infrastructure must be watched at all time for usage spikes then killed / restarted to avoid crashes.
Emailed reports or status web pages are reviewed regularly for irregularities.

How Not to Solve

I have likened bad fixes to these problems as "scotch tape", they seem to hold it together for a little bit so you don't have to worry. However with time eventually it fails and usually pretty dramatically. Patches / workarounds to these problems can add more complexity and fragility to existing software. If you use them, you might find yourself in a worse place than when you started.

Automatic data cleanup scripts put into place to workaround known data corruption scenarios.
Scripted automatic restarting of applications which crash outside of containers designed to handle crash / restarts.
Having NOC centers watch for crashes then restarting applications.
Using people to look at reports and decipher if there is corrupted data or issues from those reports.

De-Humaninizg Solutions

Here are some of the successful attempts I have used in the past to remove problems like this. Notice these are longer term solutions which might require more effort but in the end will reduce the entire companies cost for support in the long run.

Setup automated monitoring systems for data corruption scenarios which alert Tier 3 teams of issues. These monitors look for specific patterns of corruption which have been issues in the past. Plot trends of each data corruption scenario to see if the problem is worsening. Work with product teams to see the impact and get prioritization of the fixes.
Come up with crash analysis debug processes. Each time a particular type of application crashes, have the team follow the debug instructions to dive into why this happened. See if you can find a trend among the crashes and report that to development to be fixed.
Learn from every issue you encounter, setup monitoring for that issue once its complete. This way you can be proactive about issues you have seen in the past. This is key to maintaining trust with your customers who want to know you are doing something to stay on top of complaints they have had in the past. Count how many times the issue comes up and provide that feedback to your product teams for prioritization of a fix.
Break apart applications and / or databases into micro services. The smaller chunks are easier to debug and understand where the issues lie.

Remove Humans from Your Software

Notice Early Warning Signs

How Not to Solve

De-Humaninizg Solutions

Recent Posts

Comments