Here’s Why AIOps isn’t the Magic Bullet for Root Cause Analysis

essidsolutions

“What was the cause?” is the most common question we hear after an outage. In all industries we strive to give better service to clients, run our operation as smoothly as possible, avoid malfunctions and catch them as early as we can when they do happen. In this article Avishai Ish-Shalom talks about how you can understand what caused a problem allows us to fix and prevent future problems.

It is not surprising that organizations small and large invest considerable time and money into researching the cause of events – monitoring systems, post mortems and, of course, root cause analysis systems (RCA) with the promise of automatically finding the “root cause” of various problems. Unfortunately, these efforts often fail to provide the expected value. Why? As it turns out, there are several fundamental and practical problems in our search for “causes” and the whole notion of “root cause” has become highly disputed within the field of safety engineeringOpens a new window .

The debate boils down to something like this: In a complex system, we often find that no single cause is sufficient in itself to trigger the effect we are researching. A combination of multiple independent conditions is necessary for the effect to manifest but not one of them triggers it on its own. In such a case, what would be the “root” cause?

Instead, researchers are promoting the concept of ‘contributing factors’ to reflect the reality of multiple causes. Additionally, root means primary, the cause of everything else – but that is a subjective decision as the root cause itself can also be said to have been caused by something else. As such, it represents our bias of what we assume caused the problem and not an objective cause.

But these problems, as prominent as they are, seem lightweight when we ask a more fundamental question: what is a cause”? What is causality? If we hope to build tools that automatically analyze a system to find a cause, we should be able to give a concrete definition a computer can work with, or give up on the concept of automated analysis entirely.

Causality demystified

So what is a cause? In essence, it is a hypothesis that whenever A occurs, B will occur as a result. This formulation hides a critical detail, that B is expected to happen after A. In other words, there is a temporal dependence. So when we say something like “a database failure caused site blackout” we are saying that:

  1. Database fails (at time TA)
  2. Site blacks out (at time TB > TA)
  3. Other factors have negligible contribution (noise)

And that this sequence of events will always be observed. Let us convert this into a general claim by denoting “database failure” with A and “site blackout” with B. We are claiming that the sequence of events (A, B) will be observed in this temporal order (TA < TB). So when we observe event A, we know we should expect B to happen shortly after, as A causes B.

But what if A happens, and B doesn’t? How long should we wait for B before we can say that it won’t happen? What if in 1 million occurrences of A, we observed B only in 87&percnt; of trials?

We can expand our definition slightly to deal with these problems. First, we claim that not only does TA < TB but also that TB – TA < Tlimit that is, that both events must happen within a specified timeframe. Second, we assign a probability that B will happen after A, denoted as P(B|A) (the probability of B happening given that A has happened).

Now that we have a definition of causality, all we need to do is measure it, right? If only it were so easy. As it turns out, measuring causality directly is practically impossible.

Suppose we were to conduct research on whether cigarettes cause cancer. We would have two groups of people, smokers and non-smokers. We could count how many in each group develop a form of cancer after 30 years. Let’s suppose group A (the smokers) had 200 cancer cases out of 1,000 total people, and that group B (non-smokers) had 50 cancer cases out of 1,000 total. Does that prove that cigarettes cause cancer 20 percent of the time? Certainly not. This could just be coincidence; cancer could be random in the general population and we may have just picked an unlucky group by chance. It could also be that group A is different from group B in other ways than smoking. What if their lifestyle was different – could that be the underlying common cause of both smoking and cancer?

All we can really say is that there is (some) correlation between two different measured variables, and as we all know correlation does not mean causation. In other words, we can measure the effects of causality, but not causality itself. Causality is a theory of how things work, which can be corroborated or disproved, but never proved.

Learn More: How The Data Economy Makes Billions Off Of YouOpens a new window

Automated Root Cause Analysis

As causation cannot be measured, should we give up on the notion of automated system analysis? Certainly not.

While correlation does not imply causation, causation often, but not always, – especially if we didn’t measure the right factors – gives rise to correlation. We can take that to mean that correlations can be used to evaluate a hypothesis on underlying causes in the system. In other words, when we do see high correlation between A and B we can postulate that A caused B, B caused A, or that there is a yet unknown C which is a common cause of A and B. Using our probabilistic definition of causality we can even build a causal graph with conditional probabilities for assumed causes.

Using a causal graph we can employ methods such as Bayesian Causal Inference to gain statistical evidence from correlations that B is the cause of D and that C is the cause of A and B. The problem, of course, is the graph only contains assumed causes we can think of. Nothing can guarantee that the graph isn’t missing a key element or that perhaps we are seeing a combined effect of many different causes – not uncommon in complex systems.

In many ways, this is the basic problem of science itself. We can find correlations and evidence automatically, but postulating theories requires a different kind of intelligence – the kind humans aren’t particularly good at.

Learn More: The Shift to the Cloud Will Accelerate in 2020 – What Does it Mean for Organizations?Opens a new window

AIOps for the win!

Although automated root cause analysis can only give us probabilistic indications and can only find known causes, this does not mean it is useless; We don’t need to know the cause of things with 100 percent certainty to resolve issues! If the remediation procedure is cheap and safe enough, we could apply it opportunistically: if it resolves the issue, great! And if it doesn’t, no harm done.

This is the basis for AIOps: while human intervention is often too expensive to be dispensed indiscriminately, computers can run many maintenance operations at low cost. When we design monitoring for humans, we need to worry about false positives much more than with automated procedures. This makes automated root cause analysis a perfect fit for auto-remediation tools, provided our remediation policies are indeed safe to run – even if needlessly.

Cyborg Causality AnalysisIn 1997, IBMs Deep Blue won its first chess tournament and since 2006 has been consistently beating the world’ best human chess players. But for a very long time (and even today), Deep Blue was still frequently losing to humans who were aided by computers. This team-up is known as “Cyborg ChessOpens a new window ” and turned out to be a powerful combination.

Just as in chess, this powerful combination of man and machine is providing an unparalleled advantage in many domains. Automated RCA in itself cannot find causes efficiently, but a human assisted by advanced correlation and RCA tools can.

This synergy allows engineers to quickly find relevant information in the vast sea of data we are increasingly generating, to postulate hypotheses and to have machine learning algorithms analyze the evidence for them. While humans are fraught with biases, machines too have their own biases. But machine biases often prove quite different, and engineers working alongside machines help expose each other’s biases.

You are only as good as your tools, but your tools are worthless without you.

Let us know if you liked this article on LinkedInOpens a new window , TwitterOpens a new window , or FacebookOpens a new window . We would love to hear from you!