It is 03:23, your phone rings. You wake up. You pick up your phone to hear that all too familiar voice: Pagerduty alert…. Turns out, one of the microservices is spitting out 500s, messing up the website. You start debugging and find an overloaded database cluster. You scale up the cluster, wait for the website to become usable again and go back to sleep.
The next day you go to work and conduct a blameless post-mortem. You ask Why? several times to get to the root of the problem. It turns out that a crawler hit your website that night. It created an unusually high amount of requests that finally overloaded the database cluster. Satisfied with your work, you go to the kitchen and pour yourself another coffee. You understand the problem, you found the root cause. Only, you didn’t…
What is a root cause, anyway?
A root cause is the one sufficient action that triggers a process that eventually leads, through linear cause-effect relationships, to an event. That’s quite a mouthful, let’s take it apart.
- For it to be a root cause, the action we are looking for has to be a single one. Somebody pushed a button, this kind of thing.
- For the action to be sufficient, it must be the sole trigger for the process. There must be no other necessary condition to start the process.
- The process connects the trigger with the event. It has to be a linear, causal process to ensure a triggering action always leads to the event, otherwise the action would not be sufficient. B always follows A.
An example of such a system is the self-operating napkin you see below.Lifting the spoon to the mouth is all it takes to trigger a process that invariably leads to the napkin wiping the guy’s mouth. The process may be (overly) complicated but it is still a linear, causal chain.
One, two, many root causes?!
Unfortunately, reality is much more messy than that. Many systems and processes are not merely complicated but complex. They are networks of interacting elements (nodes) instead of linear causal chains. Sometimes we can isolate linear subsystems but they are always embedded in larger, networked systems.
In a networked system, effects are generally not mono-causal. Multiple nodes influence each other and with feedback loops, even themselves. An effect does not occur because of event X but because event X happened in the context of properties Y and Z. Had X happened in the context of Y and W, a different effect may have shown. Thus X, Y and Z are all contributing factors to the event. All of them are necessary but only together are they sufficient to cause the event. Oh and by the way, the system is evolving. It may have been pure chance that X, Y and Z where present in the system at the same time.
Let’s get back to our initial story. A webservice talks to a database and suddenly returns 500s to the requester. What is at fault here? Is it the fact that a crawler sends larger amounts of requests than anticipated, slowing down the database with large amounts of queries? Is it the fact that the webservice’s database queries are more expensive than necessary because they where not optimized? Is it the database itself that was sized to idealized traffic and never adapted for real world traffic? Or is it the alert threshold that somebody guessed would be a good value but actually is way too low for a user to recognize as problematic? All these events and properties conspire to form an effect. All of them are necessary, none is sufficient. There just is no single root cause here.
Choose your words wisely
The words we use influence the way we think. To think clearly, our words have to reflect the nature of the thing we are thinking about. When we speak about a root cause, our thinking is primed to search for a linear, causal process that leads us to this root cause. In a networked system, it is much more helpful to speak about something like contributing factors. This opens up our thinking to consider and evaluate multiple options. It makes us wonder what else could contribute.
As a corollary, 5 Whys is not a helpful technique to understand the causes of an event. It supports our working backwards through a linear process to identify the hidden root cause the metaphorical 5 levels down. But as we usually don’t have such a linear process to begin with, we should be very careful when to use this technique. Again, context matters!
In a networked system, a broad, exploratory mindset is much more helpful in identifying contributing factors. How could it be that X happened? What else contributed here? We should rather focus on describing the system than to explain a singular effect. Searching for an explanation narrows the mind, it tries to find closure. Describing the system, on the other hand, opens the mind to exploring the system as a whole. Explanation then becomes a side effect of understanding.