How to Perform a Simple Root Cause Analysis

[Slide 1]

Hey, everyone. In this video we look at how to perform root cause analysis and why.

[Slide 2]
First, some definitions.

Each event, wanted or unwanted, is the end of a chain of cause and effect. A root cause of an unwanted event is a failure that cascades through the chain until you get something you really dont want.

[Slide 3]
This is a model of a chain of events.

Each event causes either the conditions for the next event, the next event, or both.

Note that a single event can be affected by two or more previous events. When two events contribute, the root cause analysis must be pursued for both.

When everything works as expected, we achieve the expected outcomes or prevent unwanted outcomes, like security incidents.

However, if the conditions surrounding an event are wrong, or the event should never have happened, we get something we dont want.

[Slide 4]
Each unwanted outcome has a proximate cause.

A proximate cause is the event that directly causes the incident. However, this is not usually the root cause. Instead, the proximate cause is created by something that happened earlier in the cause and effect chain the root cause.

If we simply put a bandage on the proximate cause, we are not addressing the real problem.

The root cause will still exist with the potential to cause more problems.

[Slide 5]
The analysis approach we take in this video is the five whys.

Its simple and anyone can perform it.

We begin asking why at the incident and work back to one or more root causes

If we dont have a root cause by the fifth why, we likely missed something. We need go start at the beginning and check our work.

Its important to remember that more than one chain of events can often cause the problems we analyze. When this happens, it might be better to use more detailed analysis techniques if you have someone trained on how to use them. If you dont, you are still better off using the five whys than simply treating the proximate cause. Personally, I used only the five whys throughout my IT and IT security career. It was all I ever needed.

You can read more about other ways to use the five whys approach in this Six Sigma article.

[Slide 6]

Our final definition addresses an event. An event is something that happens, either expectedly or unexpectedly, within a set of conditions. For example,

An authorized user is given privileged access to the financial system. First Step?

Ask the first WHY. Note that each event occurs within a set of conditions. It is often the conditions within which an event occurs that produces the unexpected result. So we must always look at both the event and the conditions present at the time.

In this example, the tech made a change in accordance with policy. She received an email that appeared to be from the data owner authorizing access. However, we discover that the data owner did not actually send the email. Consequently, the next WHY should address the conditions and event that resulted in the tech receiving a forged email.

[Slide 7]
Now we step through a short root cause analysis for a security incident: the theft of intellectual property.

Our first WHY provides the conditions and event that are the proximate cause.

We see that an employee who quit on the previous Friday accessed and stole the intellectual property over the weekend because his access had not been disabled. Further, controls do not exist to detect, prevent, or alert when anomalous behavior occurs. Why would someone copy intellectual property, in this example a large number of engineering files, to an external location over the weekend?

We then proceed further toward the root cause with the second WHY. In this case, how was a terminated employee able to access network resources.

We find that the help desk tech did not see the termination email, sent in accordance with termination policy, until Monday. The termination email was not prioritized and was placed unseen amidst the forty or fifty unread email messages on Friday afternoon. This resulted in the tech not seeing the email until Monday morning.

In this example, two whys get us to the root cause. Termination messages sent to the help desk are not handled in a way that ensures the timely disabling of accounts.

Once we believe we have identified the root cause, we step back from the root cause to the final incident to see if our findings make sense to see if we missed anything.

Once satisfied, we modify specific conditions or events to help ensure we do not once again lose intellectual property due to a termination miss.

In this example, we have two ways to do this. Typically, we want to modify the chain of events as close as possible to the root cause. We could require senders to prioritize termination emails. Better yet, we could have termination messages sent to a special mailbox where all messages must be addressed immediately upon receipt. At least all terminations must be properly addressed on the day of receipt.

This results in the complete removal of the proximate cause.

Another approach here is the additional implementation of monitoring controls to look for anomalous network and user behavior. While not directly addressing the root cause in our example, monitoring would apply across all types of unwanted access resulting from known and unknown causes.

In my opinion, both the root cause and the lack of monitoring are important considerations going forward.

[Slide 8]
We close with a look at root cause analysis participants and how to conduct the analysis. The people who participate and how participation is managed are both critical pieces to an effective analysis.

The team should consist of everyone who was involved in the chain of cause and effect leading up to a security or business continuity incident. This usually includes identified staff from both IT and the business. The people selected depends on what happened and the most common probability of the links in the cause and effect chain. Sometimes, we must pause an analysis to invite others as we reveal conditions or events that need input from people not at the meeting.

The team lead does not have to be a manager. In fact, leaving managers out of these meetings can result in a more open discussion. However, the team lead should be experienced in root cause analysis procedures.

The analysis meeting is not a blame game. The leader of the analysis effort must strictly manage this. All input must be open and honest. Most of the meetings I attended were largely brainstorming sessions. Unless openness is encouraged, silence might be the only input the leader receives

Finally, consider all input relevant until proven otherwise. Do not simply disregard someones comments because the majority think they are wacky. Ive seen wacky proven relevant.

[Slide 9]

If you have questions or comments about this video, please leave them on my blog. You can also send email, including ideas for future videos, to one of the addresses listed here.

and until next time, be careful what you click.

How to Perform a Simple Root Cause Analysis

Contact ESSID Solutions

Reach out to us for a free consultation on big data consultancy and development services.