Tuesday, November 11, 2008

Planning an Effective Outage

The early days of computing involved running batch job schedules and online systems during normal business hours, leaving plenty of time for application and infrastructure personnel to make changes, generally with each person doing their own thing during their own time slot. Over time the number and duration of open slots diminished, and with the advent of ERP and customer-facing web sites requiring near-continuous availability, these slots have typically been reduced to a few hours once a month, once a quarter or even once a year. Planning and executing an effective outage requires developing a new set of project skills and a broader view of the interrelationships between the various application and infrastructure components. The purpose of this Technology Viewpoint is to offer an overall structure, one that changed a broken process into a highly effective planning tool.

The process can be broken down into the following seven components, which I'll describe in some detail and used to run a four-hour, once-a-month outage.

  • Yearly planning
  • Change requests
  • First group meeting
  • Detailed planning meeting
  • Second group meeting
  • The outage
  • Review meeting
The yearly planning process brings together the key people that are aware of major events that will occur on a predetermined date (e.g. an SAP go-live) or must be accomplished in a certain timeframe (e.g. replace a disk subsystem in Q4). The team discusses, compromises on a schedule if necessary, and broadly publishes a calendar to inform all team members which outage will contain these major items. While this is not a large effort, the results are critical to avoiding additional, unscheduled outages that otherwise would not have been required.

The second step can begin at any time, but clearly ends with detailed change requests being filed and approved no later than twelve days prior to the outage. This gives the outage coordinator time to prepare materials for the first group meeting. It's encouraged that all change requests be entered as early as possible, even though the details might be incomplete at first. This gives the outage coordinator an early warning sign if an unusual number of changes or widely impacting changes are being requested. Each request is pegged to a specific outage, making the collection of requested changes a "release".

The first group meeting takes place eleven days prior to the outage and includes the outage coordinator and one person responsible for discussing the details of each change request, preferably the person that will actually be doing the change. This meeting lasts no more than one hour and several key pieces of information are collected for each change, including:

  • The business or technical priority of the change
  • The riskiness of making the change
  • The amount of time needed to make the change
  • The amount of time needed to back-off the change
  • What resources must be available to make the change
  • What resources must be unavailable while making the change
Immediately following the first group meeting a small group, typically four to six people, meet in the detailed planning meeting to determine the sequence of actions (e.g. shutdown system A) and changes within the outage. Since there is typically a large number of changes, and a much larger list of actions to be performed to support the changes, this team focuses its creative energy on these priorities:

  • Make higher priority changes as early in the outage as possible to insure completion
  • Make riskier changes as early in the outage as possible to insure sufficient back-off time
  • Run parallel streams of changes where they do not have conflicting resource requirements
Each change is listed on a large colored "sticky" note and arranged on a white board that has vertical timeless drawn on it. By arranging each "row" to signify parallel change threads, aligning the notes in time sequence, and then drawing lines to show the relationships between changes, a picture of the outage is quickly envisioned by all members of the team.

At the completion of the detailed planning meeting there may be changes that will not fit the allotted time. These are discussed with IT management, and if necessary, business leadership, to determine if additional outage time will be permitted or the changes postponed until the next available opportunity. After gaining agreement, a project plan is crafted and published to the
group for review and updating.

The second group planning meeting is held four days prior to the outage where the project plan is reviewed line by line and any last minutes changes discussed and approved. If everyone has done a good job up to this point, this meeting goes very smoothly and everyone has a clear understanding of the changes and the scheduling.

The outage typically begins two hours prior to the first production application shutdown, during which non-production systems are shutdown and the readiness for the production outage is verified. While I wish I could say that every outage goes exactly according to plan, they simply don't, but being prepared allows the outage coordinator to make on-the-fly adjustments to insure the most critical changes are successfully completed and as many of the other planned changes as possible accomplished.

One very key, and sometimes uncomfortable, requirement during the outage is that everything that goes wrong is documented for followup. An incident is opened for each item and tracked until all items are resolved and closed. The greatest value was gained in fixing the actions performed, not in the changes themselves. An average outage will have several hundred actions (e.g. take down this, verify that, restart this) , in support of only a dozen changes. Resolving all the broken actions leads to a smoother outage, more predictable outcomes and services being restored to their proper working order.

The final step is the review meeting, held three days after the outage. Open incidents are discussed and due dates assigned, which are carried forward to the next planning cycle as long as necessary to insure they are resolved. This closed-loop process is key to continually improving the process. Ideas for improvement are discussed and incorporated into future outage planning by the outage coordinator. The coordinator for the next outage is assigned, allowing several people the opportunity to participate, learn new skills and lend their creative energies.

No comments: