The process can be broken down into the following seven components, which I'll describe in some detail and used to run a four-hour, once-a-month outage.
- Yearly planning
- Change requests
- First group meeting
- Detailed planning meeting
- Second group meeting
- The outage
- Review meeting
The second step can begin at any time, but clearly ends with detailed change requests being filed and approved no later than twelve days prior to the outage. This gives the outage coordinator time to prepare materials for the first group meeting. It's encouraged that all change requests be entered as early as possible, even though the details might be incomplete at first. This gives the outage coordinator an early warning sign if an unusual number of changes or widely impacting changes are being requested. Each request is pegged to a specific outage, making the collection of requested changes a "release".
The first group meeting takes place eleven days prior to the outage and includes the outage coordinator and one person responsible for discussing the details of each change request, preferably the person that will actually be doing the change. This meeting lasts no more than one hour and several key pieces of information are collected for each change, including:
- The business or technical priority of the change
- The riskiness of making the change
- The amount of time needed to make the change
- The amount of time needed to back-off the change
- What resources must be available to make the change
- What resources must be unavailable while making the change
- Make higher priority changes as early in the outage as possible to insure completion
- Make riskier changes as early in the outage as possible to insure sufficient back-off time
- Run parallel streams of changes where they do not have conflicting resource requirements
At the completion of the detailed planning meeting there may be changes that will not fit the allotted time. These are discussed with IT management, and if necessary, business leadership, to determine if additional outage time will be permitted or the changes postponed until the next available opportunity. After gaining agreement, a project plan is crafted and published to the
group for review and updating.
The second group planning meeting is held four days prior to the outage where the project plan is reviewed line by line and any last minutes changes discussed and approved. If everyone has done a good job up to this point, this meeting goes very smoothly and everyone has a clear understanding of the changes and the scheduling.
The outage typically begins two hours prior to the first production application shutdown, during which non-production systems are shutdown and the readiness for the production outage is verified. While I wish I could say that every outage goes exactly according to plan, they simply don't, but being prepared allows the outage coordinator to make on-the-fly adjustments to insure the most critical changes are successfully completed and as many of the other planned changes as possible accomplished.
One very key, and sometimes uncomfortable, requirement during the outage is that everything that goes wrong is documented for followup. An incident is opened for each item and tracked until all items are resolved and closed. The greatest value was gained in fixing the actions performed, not in the changes themselves. An average outage will have several hundred actions (e.g. take down this, verify that, restart this) , in support of only a dozen changes. Resolving all the broken actions leads to a smoother outage, more predictable outcomes and services being restored to their proper working order.
The final step is the review meeting, held three days after the outage. Open incidents are discussed and due dates assigned, which are carried forward to the next planning cycle as long as necessary to insure they are resolved. This closed-loop process is key to continually improving the process. Ideas for improvement are discussed and incorporated into future outage planning by the outage coordinator. The coordinator for the next outage is assigned, allowing several people the opportunity to participate, learn new skills and lend their creative energies.