Problems are a fact of life, certainly if you live in the perfectly bit-oriented world of computing where all those bits must line up in their proper place, billions of them, to be processed at rates exceeding a billion times per second. Seems on the surface to be an impossible task, but that's the comforting life for IT professionals, who relish in keeping the computing universe in alignment twenty-four hours a day, seven days a week. And when the inevitable problem occurs, it gets diagnosed and repaired at dizzying speed. Then it's back to programming more billions of, hopefully but rarely, perfect bits.
But the above is only a fraction of the whole story, and in many cases is the shortest portion, at least from the viewpoint of the user whose service was interrupted for hours or days. But this fraction is also typically the focus of the root cause analysis, which is focused on avoiding the problem in the future. And hence a larger opportunity to improve service is wasted. To get at the bigger picture, begin using the concept I call "The Perfect Problem".
Was is a Perfect Problem? This will vary somewhat depending on your service agreements, but the general idea surrounds all the other "stuff" surrounding the outside of the technical problem. Simply put, a Perfect Perfect is reported, dispatched, escalated, recovered and communicated in the way you designed. It's all the operational aspects that can consume 90-99% of the actual time a service was unavailable. It's all about asking questions, and a lot of questions, and getting straight answers so everyone gets better. Some examples of the type of questions you need to ask include:
Reported - Did the automated monitoring tools or computer operators see the problem before users started calling the help desk? Could they be improved to shave minutes or possibly hours from the overall duration of the outage?
Dispatched - Did the ticket get assigned to the correct group? Where the right people paged? Was the ticket picked up within the specified time frame? Get the ticket get bounced back and forth trying to find a home? Was all the information gathered to that point included to avoid wasting time asking for it a second or third time?
Escalated - Per procedure, was the problem escalated to a problem coordinator within the proper time period? Was management made aware of high severity problems? Were tickets opened with vendors escalated properly?
Recovered - After the service was fixed, did it come back up as quickly as usual, or were other steps needed? Was all the operational start-up documentation accurate? Could improvements to the documentation make it more clear and less error prone? Are there ways to speed up restarting the application? How did users know the service was available again?
Communications - Was a message recorded for the automated help line letting users know that you know about the problem? Did that message get updated on scheduled? Did it include all the information that should have been there? Were key business groups made aware of the right problems? Was the CIO notified that the critical systems she or he cares the most about?
In other words, a Perfect Problem went exactly like you planned it to go. Nothing more, nothing less.
To capture improvements to make future problems flow properly, each identified deficiency needs to be generate an appropriate improvement task with a clear description and owner. Each task remains open, monitored, prioritized and managed until resolved, and open items regularly reviewed by management.
In my experience the longest outages are not typically caused by unusually difficult technical problems, but deficiencies in executing the surrounding processes. And that's totally within our control to improve upon.