In a previous article titled "The Perfect Problem", I discussed finding ways to improve outage recovery time by looking at all of the operational aspects surrounding the problem. Now it's time to take a deeper look at the problem itself and see if there are clues to other, perhaps more profound, issues. I call this looking for the "root root cause" since most root cause analysis efforts don't look for a deeper meaning or really find the true source. Root causes like "the disk filled up" or "a table needed reorganized" is frequently as far as it goes. Their root root cause might be something like "no one was watching when Fred was on vacation".
I'll use three real examples, one each for people, process and technology, to demonstrate what you should be looking to undercover.
The most common fault of a root cause analysis is stopping too soon, generally before an individual is identified that made the change that resulted in an outage. I understand that people can be very sensitive to being called out, but how can we truly improve until we know where the actual problem started? The goal is not a witch hunt, but to help the person improve. It may take a few brave souls to step up and say "I made the mistake. Here's where I went wrong". The first clue that you're likely having this situation is to look at the verbiage being used. Personally I like "post-mortem", with the clear meaning that we had a death to our service and we're taking this very seriously. If you're using wording like "post-incident review", you may be in trouble. Sounds more like "we have to do this, we really don't want anyone's feelings hurt, and let's just sweep this under the rug".
One of my favorite examples of a process problem was many years ago when my boss made me responsible for the pocket-sized corporate phone directory that had just been horribly misprinted. The root cause was determined to be a mistake made by the printing company. Digging deeper, the root root cause was a process that avoided both work and blame for the prior organization. The directory was a mess, even before the fatal distribution. A new team was formed with a different mission: publish the best directory we can. No more CYA, just the absolute best we can do. By recognizing the real problem and fixing that upfront, the team went on to change the format, the paper used, the binding, and just about everything else. They met with the printer and found a sure-fire way to avoid printing errors. They met with other local companies and brought back fresh ideas. They met over lunch and looked for errors, even going as far as calling Hawaii to find a mistake and have it corrected. They were rewarded with more positive feedback than they ever imagined. All by starting with an approach 180 degrees opposite and finding their own way.
Technology breaks and technology has bugs. But technology can also be put together in ways, particularly over time, that ends up having negative effects much greater than expected. Such was the case of our building's local area network in the 1990's. We had an outage due to a hosed-up network switch that had user PC's attached. But what was puzzling was why did it affect a half-dozen production servers in the computer room. The answer, its root root cause, was a network architecture that resembled a balloon that was getting bigger and bigger. One problem caused the balloon to pop. So the answer was a new design that would grow horizontally and isolate different services and still allow them to communicate. A new design was created that housed servers, PC's, the Internet and the wide-area network in different "towers" connected by routers that prevented many of the issues inherent in the "balloon". Almost by magic, the network was more reliable, faster and able to change quicker with less disruption. And we spent less money by using far cheaper equipment. All this by identifying and solving the real problem.
As you might guess, this can take a lot of work, finding and fixing the root root causes.