The net entropy (degree of disorder) of any closed system will always increase. The same applies to IT Systems/Tech platforms. At some point, this will lead to a failure/breakdown.

A top quartile tech firm would probably have well engineered observability components. The system will tell you exactly what and where the problem is. Assuming you weren't so lucky (the majority), you would jump into a temporary space called War Room or Hotline to diagnose the issue.

Looking at past examples it is possible to use a heuristic approach to make this process of finding the root more methodical. Visualize the system as a glass box with a machine in it.


The Four Heuristics

1. What was added to the core/machine

When we add a new component (irrespective of the size), there is bound to be an increased disorder. This is almost the first thing which should be checked before moving to other categories. Deployment of a piece of code is the most common example.

2. Compare good vs bad

When you cannot pinpoint what changed, you can compare the 'good' (situations where the system was/is working) against the 'bad' (when the system isn't working as expected). There are two ways of defining good and bad:

3. Failure mode analysis / first principles

This is a linear way of searching for an issue. In this method, you check every action and its expected reaction. Say button A is supposed to trigger 2 actions simultaneously; did that happen? If yes, then move to the next step.

The choice of approach depends on whichever is likely to get you to the root cause faster. In most scenarios, stepping backward is quicker.

4. Change in operating environment

Heuristics 1 to 3 were all about the machine within the box. Now we focus on the relationship between the machine and the box.