A Heuristic Approach to IT Failures

The net entropy (degree of disorder) of any closed system will always increase. The same applies to IT Systems/Tech platforms. At some point, this will lead to a failure/breakdown.

A top quartile tech firm would probably have well engineered observability components. The system will tell you exactly what and where the problem is. Assuming you weren't so lucky (the majority), you would jump into a temporary space called War Room or Hotline to diagnose the issue.

Looking at past examples it is possible to use a heuristic approach to make this process of finding the root more methodical. Visualize the system as a glass box with a machine in it.

The Four Heuristics

1. What was added to the core/machine

When we add a new component (irrespective of the size), there is bound to be an increased disorder. This is almost the first thing which should be checked before moving to other categories. Deployment of a piece of code is the most common example.

2. Compare good vs bad

When you cannot pinpoint what changed, you can compare the 'good' (situations where the system was/is working) against the 'bad' (when the system isn't working as expected). There are two ways of defining good and bad:

On a time scale, i.e., before and after a point in time. You will also be able to define when something went wrong.
On any other parameter, i.e., compare different populations. Typical examples being — compare different users, workflows, locations.

3. Failure mode analysis / first principles

This is a linear way of searching for an issue. In this method, you check every action and its expected reaction. Say button A is supposed to trigger 2 actions simultaneously; did that happen? If yes, then move to the next step.

Step forward from the first possible step and work towards the last action the system was supposed to perform.
Step backward working from the final expected outcome to the previous step and so on till you reach the failure point.

The choice of approach depends on whichever is likely to get you to the root cause faster. In most scenarios, stepping backward is quicker.

4. Change in operating environment

Heuristics 1 to 3 were all about the machine within the box. Now we focus on the relationship between the machine and the box.

Constraints or limits: If the machine becomes too hot and the box is not designed to accommodate that, the system will fail. Check if certain constraints or limits are being hit at the environment level.
Instability of the environment: There might be an issue with the box itself (like a crack). If you have not found any issue in heuristics 1-3, it's a good idea to check this.