Technology ViewPoint: 2008

Often the lessons of the past shed light on how to solve the problems of the present. As the title of this post hopefully suggests, the early days of computing were spent in splendid isolation. We didn't connect to each other except through a few shared business applications. There were no PCs, no LANs, no emails and no Internet. And our world was very secure as a result.

The beginning of the client/server era was ushered in with the PC, with communications generally limited to low-speed modems and perhaps a coax connection to maintain compatibility with the mainframe. Data was shared using 5 1/4" floppy disks capable of less than a megabyte of information. But the beginning of the end of isolation had arrived.

The introduction of the aforementioned LANs, emails and Internet ended our splendid isolation and provided the platform for spam, viruses of all types and rootkits, just to name a few, to flourish. And ever since we've been piling on solutions to combat these ills, but rarely striking blows in the proper direction, returning to splendid isolation, architected from the ground up.

In 2003 the MS-Blaster virus infected a large number of PCs, including my work computer, resulting in a few days of late nights, manic patching and shutting off connections to the outside world. Quite the opposite happened with my home computer which was never infected. After restoring service for the company, I reflected on how my professional managed PC, equipped with all the latest anti-virus software, up-to-date patches and protected by world-class Internet firewalls got infected but my home computer, equipped with free anti-virus software and a $50 firewall breezed right through. Was it the luck of the draw?

The answer was the splendid isolation my home computer existed in. MS-Blaster spread from PC to PC across open LAN connections. My home computer simply was all alone. In my work environment there is no need for my PC to communicate directly to any other PC; all sharing is performed via servers and applications in the data center. Yet the LAN is open for no particularly good reason, it just was always done that way. So we can learn from the Internet, where trust is not granted, and extend that same model to our internal networks. Just make it impossible for PCs to talk to each other. Return them to a greater degree of isolation and greatly reduce the chances that an MS-Blaster can spread.

Since the MS-Blaster incident, my home network has grown and now includes a few laptops and a couple desktops. Learning the lesson from the past, I implemented AP (Access Point) Isolation on my wireless router/firewall, which allows Internet access only, once again returning each of my PCs to their own islands of isolation.

Another symptom of lost isolation presents itself with software vendors requiring their own servers to support their products even when their product requires only a portion of its computing power. Dating back to the early days of Windows, applications could easily interfere with each other and the operating system lacked the controls to prioritize workload, isolate memory and insure system stability. So it's not surprising that virtualization solutions, for example VMWare and Xen, have become very popular in the last couple years. While vendors initially resisted supporting their applications in a virtual machine, market realities changed their minds. Emerging cloud computing infrastructures have virtualization at their core as they strive to serve multiple customers on common hardware. While this form of isolation, both from multiple applications and a variety of server hardware, does not solve the problem at its source, it does provide service and risk benefits beyond a modest lowering of the platform cost.

VMWare appliances are an innovative method to deliver software demonstrations or a complete Linux distribution that you can run on a Windows desktop. A secure environment could be developed to enable a corporate desktop to run as a virtual appliance on an employee's home computer, eliminating the need to lug a laptop back and forth to work every day. The benefits of isolation are just beginning to be explored.

Making isolation, in all its various forms, a key technology strategy will lead to lower costs, improved service levels and some welcome relief for your security department.

The early days of computing involved running batch job schedules and online systems during normal business hours, leaving plenty of time for application and infrastructure personnel to make changes, generally with each person doing their own thing during their own time slot. Over time the number and duration of open slots diminished, and with the advent of ERP and customer-facing web sites requiring near-continuous availability, these slots have typically been reduced to a few hours once a month, once a quarter or even once a year. Planning and executing an effective outage requires developing a new set of project skills and a broader view of the interrelationships between the various application and infrastructure components. The purpose of this Technology Viewpoint is to offer an overall structure, one that changed a broken process into a highly effective planning tool.

The process can be broken down into the following seven components, which I'll describe in some detail and used to run a four-hour, once-a-month outage.

Yearly planning
Change requests
First group meeting
Detailed planning meeting
Second group meeting
The outage
Review meeting

The yearly planning process brings together the key people that are aware of major events that will occur on a predetermined date (e.g. an SAP go-live) or must be accomplished in a certain timeframe (e.g. replace a disk subsystem in Q4). The team discusses, compromises on a schedule if necessary, and broadly publishes a calendar to inform all team members which outage will contain these major items. While this is not a large effort, the results are critical to avoiding additional, unscheduled outages that otherwise would not have been required.

The second step can begin at any time, but clearly ends with detailed change requests being filed and approved no later than twelve days prior to the outage. This gives the outage coordinator time to prepare materials for the first group meeting. It's encouraged that all change requests be entered as early as possible, even though the details might be incomplete at first. This gives the outage coordinator an early warning sign if an unusual number of changes or widely impacting changes are being requested. Each request is pegged to a specific outage, making the collection of requested changes a "release".

The first group meeting takes place eleven days prior to the outage and includes the outage coordinator and one person responsible for discussing the details of each change request, preferably the person that will actually be doing the change. This meeting lasts no more than one hour and several key pieces of information are collected for each change, including:

The business or technical priority of the change
The riskiness of making the change
The amount of time needed to make the change
The amount of time needed to back-off the change
What resources must be available to make the change
What resources must be unavailable while making the change

Immediately following the first group meeting a small group, typically four to six people, meet in the detailed planning meeting to determine the sequence of actions (e.g. shutdown system A) and changes within the outage. Since there is typically a large number of changes, and a much larger list of actions to be performed to support the changes, this team focuses its creative energy on these priorities:

Make higher priority changes as early in the outage as possible to insure completion
Make riskier changes as early in the outage as possible to insure sufficient back-off time
Run parallel streams of changes where they do not have conflicting resource requirements

Each change is listed on a large colored "sticky" note and arranged on a white board that has vertical timeless drawn on it. By arranging each "row" to signify parallel change threads, aligning the notes in time sequence, and then drawing lines to show the relationships between changes, a picture of the outage is quickly envisioned by all members of the team.

At the completion of the detailed planning meeting there may be changes that will not fit the allotted time. These are discussed with IT management, and if necessary, business leadership, to determine if additional outage time will be permitted or the changes postponed until the next available opportunity. After gaining agreement, a project plan is crafted and published to the
group for review and updating.

The second group planning meeting is held four days prior to the outage where the project plan is reviewed line by line and any last minutes changes discussed and approved. If everyone has done a good job up to this point, this meeting goes very smoothly and everyone has a clear understanding of the changes and the scheduling.

The outage typically begins two hours prior to the first production application shutdown, during which non-production systems are shutdown and the readiness for the production outage is verified. While I wish I could say that every outage goes exactly according to plan, they simply don't, but being prepared allows the outage coordinator to make on-the-fly adjustments to insure the most critical changes are successfully completed and as many of the other planned changes as possible accomplished.

One very key, and sometimes uncomfortable, requirement during the outage is that everything that goes wrong is documented for followup. An incident is opened for each item and tracked until all items are resolved and closed. The greatest value was gained in fixing the actions performed, not in the changes themselves. An average outage will have several hundred actions (e.g. take down this, verify that, restart this) , in support of only a dozen changes. Resolving all the broken actions leads to a smoother outage, more predictable outcomes and services being restored to their proper working order.

The final step is the review meeting, held three days after the outage. Open incidents are discussed and due dates assigned, which are carried forward to the next planning cycle as long as necessary to insure they are resolved. This closed-loop process is key to continually improving the process. Ideas for improvement are discussed and incorporated into future outage planning by the outage coordinator. The coordinator for the next outage is assigned, allowing several people the opportunity to participate, learn new skills and lend their creative energies.

Technology ViewPoint

Wednesday, November 26, 2008

Splendid Isolation

Tuesday, November 11, 2008

Planning an Effective Outage