Saturday, February 28, 2009

A Four-Tiered Approach to Standards

Answering the simple question of what is your standard for a particular product, naming convention or password strength is often more involved that just a simple answer. The approach I use to set and communicate standards is a four-tier approach using Preferred, Standard, Non-Standard and Exception as categories.

Preferred is simply the product that I would like to use above all others. This could be for strategic reasons, advantageous licensing, skill base or any number of factors that make it rise above the rest. There is typically only one product signed with the Preferred tag.

Standard is the category for any remaining products that we offer internal support. Anything in the Preferred or Standard categories can be expected to be fully supported with multiple, skilled resources available and with defined Service Level Agreements. Anything rated below these two categories are a warning that issues will need to be overcome before using those products.

Non-standard is usually the catch-all category, naming common products that the internal staff does not have the skill base to support. The importance of this category is to inform decision-makers that additional costs need to be budgeted and that their support team will need to contract with others for support. Making this a dollars discussion usually drives the decision towards a Standard.

Exception means that I do not have the authority to allow this inside my company and that the decision-maker will need to have a discussion with someone higher up the organization chart. I’ve found Exception is better than No. No rarely works until that higher-up says No anyway. Exception says let me explain the situation, agree to disagree and let you know how you can press your case and the obstacles you may find. It turns an adversarial conversation into a useful, and professional, conversation.

Let’s make this more real with a totally fictional database example.

  • Preferred – Oracle – Company owns a site license, maintains a highly reliable database farm and all the staff is certified.
  • Standard – Microsoft SQLServer on Windows and IBM DB2 on AIX – Company has purchased a number of applications that did not offer Oracle as an option and at least three DBA’s are skilled in both.
  • Non-Standard – All others not specially listed including but not limited to Informix, MySQL and IBM DB2 on Windows. Use as an embedded database requires 100% vendor support.
  • Exception – Any mainframe database, since that platform is being decommissioned.
These standards can become quite involved and should explain as much of the “it depends” as possible. But there can be inter-dependencies between standards or special circumstances that make absolute statements impossible to craft. For example, maybe the vendor just started offering Oracle support but has a long track record on SQLServer. Judgments on these types of cases will be necessary.

Also recognize that your standards will change over time and will need to be communicated as they are modified. A good time to do that is a month or so before you get involved in the budget cycle. Maybe you’ll need some additional training dollars for a product you want to make Standard. Perhaps some outside contractor resources will be needed to cover a product demoted to Non-Standard. Or maybe a capital outlay for a new site license for that Preferred standard that is really taking off.

Saturday, February 21, 2009

Eliminate, Automate and Delegate

Sometimes just having a simple methodology for approaching your work helps provide focus and achieve better results. One of the approaches I commonly use is Eliminate, Automate, Delegate, and approach them in that order. This is no means rocket science, but I commonly run across efforts that fail to deliver the best results that could have used this approach.

Eliminate is by far the best result that can be obtained. Start by brainstorming ideas to completely eliminate the need for whatever you’re looking to improve. If it can’t completely done away with, can at least some portion of it go away? Twenty years ago I was involved in a project to access email via the telephone. The original approach was proving daunting and threatening to kill the project. A group got together and looked for a way to resolve the problem. The solution involved moving from a full-screen to a line-mode interface which eliminated seventy-five percent of the coding effort and made the service much more reliable. Considering an option that looked like going backwards (line-mode was so 1970’s) prove to be the key. The prototype was available a few days later.

Automate is replacing human effort with a non-human effort. Job scheduling is a common data center example and robots welding cars applies in the manufacturing world. How many web sites do you check out each day for information? Perhaps moving to an RSS reader, a form of automation that pulls in articles of interest, is a more efficient way to gather that information. Alerting is a common output of automation, only interrupting you when necessary. In this case, you’ve both automated the task and eliminated the need to check it out as often.

Delegate is taking the work done by a higher-paid person and shifting it to a lower-paid, but still qualified, person. Too often professionals spend a large amount of their time doing work at a grade level far below what they are paid. In some cases a person’s desire to perform a lower-valued task comes from their pride in building a solution from the start and its “their baby”. What the reason, good or bad, spending too much time performing lower-valued work will limit your time for new projects, ending with no more new “babies” to take pride in. Delegate takes a commitment to training, letting people make those mistakes (after all, you made yours along the way) and encouraging them. Think hard before you accept a “I’ll just do it myself” attitude.

Things you eliminate can no longer go wrong or waste money. Tasks that you automate usually cost a fraction of a human and it never gets tired or bored. Delegation creates valuable time for the highly skilled and develops new skills in others. But the key is to approach them in the proper order and get enough ideas generated to fully explore an opportunity.

Wednesday, November 26, 2008

Splendid Isolation

Often the lessons of the past shed light on how to solve the problems of the present. As the title of this post hopefully suggests, the early days of computing were spent in splendid isolation. We didn't connect to each other except through a few shared business applications. There were no PCs, no LANs, no emails and no Internet. And our world was very secure as a result.

The beginning of the client/server era was ushered in with the PC, with communications generally limited to low-speed modems and perhaps a coax connection to maintain compatibility with the mainframe. Data was shared using 5 1/4" floppy disks capable of less than a megabyte of information. But the beginning of the end of isolation had arrived.

The introduction of the aforementioned LANs, emails and Internet ended our splendid isolation and provided the platform for spam, viruses of all types and rootkits, just to name a few, to flourish. And ever since we've been piling on solutions to combat these ills, but rarely striking blows in the proper direction, returning to splendid isolation, architected from the ground up.

In 2003 the MS-Blaster virus infected a large number of PCs, including my work computer, resulting in a few days of late nights, manic patching and shutting off connections to the outside world. Quite the opposite happened with my home computer which was never infected. After restoring service for the company, I reflected on how my professional managed PC, equipped with all the latest anti-virus software, up-to-date patches and protected by world-class Internet firewalls got infected but my home computer, equipped with free anti-virus software and a $50 firewall breezed right through. Was it the luck of the draw?

The answer was the splendid isolation my home computer existed in. MS-Blaster spread from PC to PC across open LAN connections. My home computer simply was all alone. In my work environment there is no need for my PC to communicate directly to any other PC; all sharing is performed via servers and applications in the data center. Yet the LAN is open for no particularly good reason, it just was always done that way. So we can learn from the Internet, where trust is not granted, and extend that same model to our internal networks. Just make it impossible for PCs to talk to each other. Return them to a greater degree of isolation and greatly reduce the chances that an MS-Blaster can spread.

Since the MS-Blaster incident, my home network has grown and now includes a few laptops and a couple desktops. Learning the lesson from the past, I implemented AP (Access Point) Isolation on my wireless router/firewall, which allows Internet access only, once again returning each of my PCs to their own islands of isolation.

Another symptom of lost isolation presents itself with software vendors requiring their own servers to support their products even when their product requires only a portion of its computing power. Dating back to the early days of Windows, applications could easily interfere with each other and the operating system lacked the controls to prioritize workload, isolate memory and insure system stability. So it's not surprising that virtualization solutions, for example VMWare and Xen, have become very popular in the last couple years. While vendors initially resisted supporting their applications in a virtual machine, market realities changed their minds. Emerging cloud computing infrastructures have virtualization at their core as they strive to serve multiple customers on common hardware. While this form of isolation, both from multiple applications and a variety of server hardware, does not solve the problem at its source, it does provide service and risk benefits beyond a modest lowering of the platform cost.

VMWare appliances are an innovative method to deliver software demonstrations or a complete Linux distribution that you can run on a Windows desktop. A secure environment could be developed to enable a corporate desktop to run as a virtual appliance on an employee's home computer, eliminating the need to lug a laptop back and forth to work every day. The benefits of isolation are just beginning to be explored.

Making isolation, in all its various forms, a key technology strategy will lead to lower costs, improved service levels and some welcome relief for your security department.

Tuesday, November 11, 2008

Planning an Effective Outage

The early days of computing involved running batch job schedules and online systems during normal business hours, leaving plenty of time for application and infrastructure personnel to make changes, generally with each person doing their own thing during their own time slot. Over time the number and duration of open slots diminished, and with the advent of ERP and customer-facing web sites requiring near-continuous availability, these slots have typically been reduced to a few hours once a month, once a quarter or even once a year. Planning and executing an effective outage requires developing a new set of project skills and a broader view of the interrelationships between the various application and infrastructure components. The purpose of this Technology Viewpoint is to offer an overall structure, one that changed a broken process into a highly effective planning tool.

The process can be broken down into the following seven components, which I'll describe in some detail and used to run a four-hour, once-a-month outage.

  • Yearly planning
  • Change requests
  • First group meeting
  • Detailed planning meeting
  • Second group meeting
  • The outage
  • Review meeting
The yearly planning process brings together the key people that are aware of major events that will occur on a predetermined date (e.g. an SAP go-live) or must be accomplished in a certain timeframe (e.g. replace a disk subsystem in Q4). The team discusses, compromises on a schedule if necessary, and broadly publishes a calendar to inform all team members which outage will contain these major items. While this is not a large effort, the results are critical to avoiding additional, unscheduled outages that otherwise would not have been required.

The second step can begin at any time, but clearly ends with detailed change requests being filed and approved no later than twelve days prior to the outage. This gives the outage coordinator time to prepare materials for the first group meeting. It's encouraged that all change requests be entered as early as possible, even though the details might be incomplete at first. This gives the outage coordinator an early warning sign if an unusual number of changes or widely impacting changes are being requested. Each request is pegged to a specific outage, making the collection of requested changes a "release".

The first group meeting takes place eleven days prior to the outage and includes the outage coordinator and one person responsible for discussing the details of each change request, preferably the person that will actually be doing the change. This meeting lasts no more than one hour and several key pieces of information are collected for each change, including:

  • The business or technical priority of the change
  • The riskiness of making the change
  • The amount of time needed to make the change
  • The amount of time needed to back-off the change
  • What resources must be available to make the change
  • What resources must be unavailable while making the change
Immediately following the first group meeting a small group, typically four to six people, meet in the detailed planning meeting to determine the sequence of actions (e.g. shutdown system A) and changes within the outage. Since there is typically a large number of changes, and a much larger list of actions to be performed to support the changes, this team focuses its creative energy on these priorities:

  • Make higher priority changes as early in the outage as possible to insure completion
  • Make riskier changes as early in the outage as possible to insure sufficient back-off time
  • Run parallel streams of changes where they do not have conflicting resource requirements
Each change is listed on a large colored "sticky" note and arranged on a white board that has vertical timeless drawn on it. By arranging each "row" to signify parallel change threads, aligning the notes in time sequence, and then drawing lines to show the relationships between changes, a picture of the outage is quickly envisioned by all members of the team.

At the completion of the detailed planning meeting there may be changes that will not fit the allotted time. These are discussed with IT management, and if necessary, business leadership, to determine if additional outage time will be permitted or the changes postponed until the next available opportunity. After gaining agreement, a project plan is crafted and published to the
group for review and updating.

The second group planning meeting is held four days prior to the outage where the project plan is reviewed line by line and any last minutes changes discussed and approved. If everyone has done a good job up to this point, this meeting goes very smoothly and everyone has a clear understanding of the changes and the scheduling.

The outage typically begins two hours prior to the first production application shutdown, during which non-production systems are shutdown and the readiness for the production outage is verified. While I wish I could say that every outage goes exactly according to plan, they simply don't, but being prepared allows the outage coordinator to make on-the-fly adjustments to insure the most critical changes are successfully completed and as many of the other planned changes as possible accomplished.

One very key, and sometimes uncomfortable, requirement during the outage is that everything that goes wrong is documented for followup. An incident is opened for each item and tracked until all items are resolved and closed. The greatest value was gained in fixing the actions performed, not in the changes themselves. An average outage will have several hundred actions (e.g. take down this, verify that, restart this) , in support of only a dozen changes. Resolving all the broken actions leads to a smoother outage, more predictable outcomes and services being restored to their proper working order.

The final step is the review meeting, held three days after the outage. Open incidents are discussed and due dates assigned, which are carried forward to the next planning cycle as long as necessary to insure they are resolved. This closed-loop process is key to continually improving the process. Ideas for improvement are discussed and incorporated into future outage planning by the outage coordinator. The coordinator for the next outage is assigned, allowing several people the opportunity to participate, learn new skills and lend their creative energies.