Sunday, April 5, 2009
Applying Data Center Recovery Principles to PCs
Large companies installed mainframes in the 1960s to automate back office processes, gain efficiencies and enable new and larger business models. It soon became apparent that the data center needed to be well protected. Guards, keypass entry cards, UPS power and fire suppression systems became the norm. But investments to keep the data center safe were not enough and the disaster recovery business was created to allow a company, at a fraction of the cost of running a duplicate data center, to recover their critical applications if the primary data center was unusable. You could never spend enough money to reduce the risk of losing the primary data center to zero.
Perhaps applying the same principles to recovering access to these same applications in the advent of a massive virus outbreak, power blackout or communications failure would provide a cost-effective solution. These applications might also be the same ones that employees need while working remotely.
There are many ways to architect and design a solution, but the least common denominator in today's world is the web browser. If you're fortunate enough to have all your applications web-enabled, then you have a huge head start. Perhaps a dual-boot option on your corporate PCs with a Linux/Firefox option is enough to get your users productive again. Another strategy would be to have employees use their home computers, almost ubiquitous now, as their backup device. A final option would be to re-stage each PC, although this may take more time to accomplish than the business might be able to tolerate.
For those not fortunate enough to be fully web-enabled, which includes most of us, a solution to access those applications needs to be available, but not require a huge investment in hardware and software. The advent of pay-as-you-go Cloud Computing and more robust Open Source software comes to the rescue. The idea is to build a ready-to-go desktop image in the Cloud (e.g. Amazon Web Services) using Linux, Firefox, native Linux applications and Windows applications under the Wine environment. This image would have the necessary VPN connectivity to your data center to access the back-end services. Each user would spin up a copy of the image, with proper authentication of course, and be back in business in minutes. Or perhaps leveraging open source virtualization software can allow multiple people to use one Cloud server concurrently.
This image might also be used for home or hotel access, and potentially avoid the extra costs of providing laptops by leveraging personal and hotel business-center PCs. A copy of this image that provides isolated access during your disaster recovery testing can significantly reduce that network effort. These are just a few of the possible uses for a solution architected in this manner.
Monday, March 23, 2009
Who Has A Bigger Problem
Odds are someone has a bigger problem than you do and looking outside your normal field of vision is often needed. Looking to reduce the cost of your email system? Try an organization such as a university, other large not-for-profit corporation or a particularly financially distressed company for ideas. Want to improve the integrity of your data center? Can you find a company where revenue dead stops when they're down. How about the stock market? If they make the Wall Street Journal headlines when problems occur, that's a good place to start.
Money, or lack thereof, is a good place to start hunting. Following the money trail is also useful. Are you looking for management support for a new security idea and the CIO isn't very receptive? Who else in your organization is rewarded when incidents are reduced or eliminated? With the advent of SOX compliance putting more people on the chopping block for deficiencies, perhaps Audit, Compliance or the CFO has that bigger problem.
Sales people are rarely a good source, simply because they sell a solution to your problem, not to their problem. Their problem is making their quota, and rightfully so. Helping solve your problem may be aligned at times, but in most cases they sell stuff, not solutions, and their reward is not directly tied to your problem being solved, but in getting the contract signed. Sales people can be a good conduit into making connections into the companies you want to investigate. A better way is to join and be active in one or two few large user groups or leverage a company subscription to The Hackett Group, Gartner Group or other advisory and benchmarking firms. Using LinkedIn, Plaxo or other web-based social networks can also lead to making the right contact.
The idea is not just to implement everything another company does, but to generate new insights into your specific problem. Perhaps one or two components of their solution is enough to satisfy your current needs. Stretching your mind in the direction of the problem, not an immediate solution, may be the key to your next "breakthrough".
Tuesday, March 10, 2009
Integration Architecture
- Batch - A set of requests will be processed together at regular intervals. This is typically done to process requests more efficiently or during a window when resources are more available. This is a time-based mechanism.
- Asynchronous - A response to a request is needed as soon as possible. The data will be delivered when the other side is ready and will never be lost. This is an event-based trigger mechanism.
- Real-Time - A response to a request is required in real-time. In case the request is not fulfilled within a reasonable amount of time, the data will be discarded.
Batch interfaces have been around since the advent of mainframes and punch cards. A single method to handle batch data exchange should be used throughout the data center to simplify all operational aspects from security to recovery. One folder structure could be developed for all Production data and a second for Test data. These folders should be mountable to all systems, so multiple protocols (e.g. NFS, SMB) may be required. File system renaming is a useful practice to keep in-flight work from causing issues. As an example, perhaps we have a folder named Payroll. Within the Payroll folder, we create an Inbound, Ready, InProcess, Processed and Error folders. Data being created is put in the Inbound folder and when complete is renamed to put it in the Ready folder. A batch job runs looks in the Ready folder, renames it to put it in the InProcess folder and renames it again upon successful completion to the Processed folder or to the Error folder if unsuccessful. This simple example may not be enough for your requirements, so expand the concept with as many folders and subfolders needed.
Asynchronous interfaces are typically built upon a messaging queuing infrastructure using products such as Microsoft's MSMQ or IBM's MQSeries. Like Batch, you should establish only one method for this type of exchange. Asynchronous interfaces are typically used to process data that needs very quick turnaround, but the data can't be lost if the target system is unavailable.
There are a number of reasons that individual transactions can fail, special attention needs to be paid to build a notification system with the needed data available to take quick action to resolve the error. An approach to this is to make a copy of any transaction that fails and put it into a message queue where a program will read that data, send the appropriate notifications and make the data and error code available to the person performing the troubleshooting. A method of "replaying" the transaction by putting the transaction back on its original queue is useful to avoid manually performing each failed transaction.
Real-time interfaces come in lots of shapes and sizes and will vary from one vendor to the next. Standardization is coming slowly with the adoption of web standards, so you're likely stuck with supporting a variety of proprietary and open standards, fortunately in most cases with the aid of the vendor who knows their own choices well. Still, it will cause a great deal of operational support issues and these interfaces will tend to be your most critical. Put together a team to attack these issues before they become a business issue.
Your standards can be drawn using a generic Source System and Target System on either side of a diagram and your Real-Time, Asynchronous and Batch solutions connecting the two Systems. Each of the three paths describe the specific hardware and software that is standard in your environment. The simple diagram can then be expanded with the specific details for a particular set of Systems.
A few number of highly reusable components will speed the delivery of your interfaces at a greatly reduced cost, and most important, with the least amount of operational issues and business impact.
Saturday, February 28, 2009
A Four-Tiered Approach to Standards
Preferred is simply the product that I would like to use above all others. This could be for strategic reasons, advantageous licensing, skill base or any number of factors that make it rise above the rest. There is typically only one product signed with the Preferred tag.
Standard is the category for any remaining products that we offer internal support. Anything in the Preferred or Standard categories can be expected to be fully supported with multiple, skilled resources available and with defined Service Level Agreements. Anything rated below these two categories are a warning that issues will need to be overcome before using those products.
Non-standard is usually the catch-all category, naming common products that the internal staff does not have the skill base to support. The importance of this category is to inform decision-makers that additional costs need to be budgeted and that their support team will need to contract with others for support. Making this a dollars discussion usually drives the decision towards a Standard.
Exception means that I do not have the authority to allow this inside my company and that the decision-maker will need to have a discussion with someone higher up the organization chart. I’ve found Exception is better than No. No rarely works until that higher-up says No anyway. Exception says let me explain the situation, agree to disagree and let you know how you can press your case and the obstacles you may find. It turns an adversarial conversation into a useful, and professional, conversation.
Let’s make this more real with a totally fictional database example.
- Preferred – Oracle – Company owns a site license, maintains a highly reliable database farm and all the staff is certified.
- Standard – Microsoft SQLServer on Windows and IBM DB2 on AIX – Company has purchased a number of applications that did not offer Oracle as an option and at least three DBA’s are skilled in both.
- Non-Standard – All others not specially listed including but not limited to Informix, MySQL and IBM DB2 on Windows. Use as an embedded database requires 100% vendor support.
- Exception – Any mainframe database, since that platform is being decommissioned.
Also recognize that your standards will change over time and will need to be communicated as they are modified. A good time to do that is a month or so before you get involved in the budget cycle. Maybe you’ll need some additional training dollars for a product you want to make Standard. Perhaps some outside contractor resources will be needed to cover a product demoted to Non-Standard. Or maybe a capital outlay for a new site license for that Preferred standard that is really taking off.
Saturday, February 21, 2009
Eliminate, Automate and Delegate
Eliminate is by far the best result that can be obtained. Start by brainstorming ideas to completely eliminate the need for whatever you’re looking to improve. If it can’t completely done away with, can at least some portion of it go away? Twenty years ago I was involved in a project to access email via the telephone. The original approach was proving daunting and threatening to kill the project. A group got together and looked for a way to resolve the problem. The solution involved moving from a full-screen to a line-mode interface which eliminated seventy-five percent of the coding effort and made the service much more reliable. Considering an option that looked like going backwards (line-mode was so 1970’s) prove to be the key. The prototype was available a few days later.
Automate is replacing human effort with a non-human effort. Job scheduling is a common data center example and robots welding cars applies in the manufacturing world. How many web sites do you check out each day for information? Perhaps moving to an RSS reader, a form of automation that pulls in articles of interest, is a more efficient way to gather that information. Alerting is a common output of automation, only interrupting you when necessary. In this case, you’ve both automated the task and eliminated the need to check it out as often.
Delegate is taking the work done by a higher-paid person and shifting it to a lower-paid, but still qualified, person. Too often professionals spend a large amount of their time doing work at a grade level far below what they are paid. In some cases a person’s desire to perform a lower-valued task comes from their pride in building a solution from the start and its “their baby”. What the reason, good or bad, spending too much time performing lower-valued work will limit your time for new projects, ending with no more new “babies” to take pride in. Delegate takes a commitment to training, letting people make those mistakes (after all, you made yours along the way) and encouraging them. Think hard before you accept a “I’ll just do it myself” attitude.
Things you eliminate can no longer go wrong or waste money. Tasks that you automate usually cost a fraction of a human and it never gets tired or bored. Delegation creates valuable time for the highly skilled and develops new skills in others. But the key is to approach them in the proper order and get enough ideas generated to fully explore an opportunity.
Wednesday, November 26, 2008
Splendid Isolation
The beginning of the client/server era was ushered in with the PC, with communications generally limited to low-speed modems and perhaps a coax connection to maintain compatibility with the mainframe. Data was shared using 5 1/4" floppy disks capable of less than a megabyte of information. But the beginning of the end of isolation had arrived.
The introduction of the aforementioned LANs, emails and Internet ended our splendid isolation and provided the platform for spam, viruses of all types and rootkits, just to name a few, to flourish. And ever since we've been piling on solutions to combat these ills, but rarely striking blows in the proper direction, returning to splendid isolation, architected from the ground up.
In 2003 the MS-Blaster virus infected a large number of PCs, including my work computer, resulting in a few days of late nights, manic patching and shutting off connections to the outside world. Quite the opposite happened with my home computer which was never infected. After restoring service for the company, I reflected on how my professional managed PC, equipped with all the latest anti-virus software, up-to-date patches and protected by world-class Internet firewalls got infected but my home computer, equipped with free anti-virus software and a $50 firewall breezed right through. Was it the luck of the draw?
The answer was the splendid isolation my home computer existed in. MS-Blaster spread from PC to PC across open LAN connections. My home computer simply was all alone. In my work environment there is no need for my PC to communicate directly to any other PC; all sharing is performed via servers and applications in the data center. Yet the LAN is open for no particularly good reason, it just was always done that way. So we can learn from the Internet, where trust is not granted, and extend that same model to our internal networks. Just make it impossible for PCs to talk to each other. Return them to a greater degree of isolation and greatly reduce the chances that an MS-Blaster can spread.
Since the MS-Blaster incident, my home network has grown and now includes a few laptops and a couple desktops. Learning the lesson from the past, I implemented AP (Access Point) Isolation on my wireless router/firewall, which allows Internet access only, once again returning each of my PCs to their own islands of isolation.
Another symptom of lost isolation presents itself with software vendors requiring their own servers to support their products even when their product requires only a portion of its computing power. Dating back to the early days of Windows, applications could easily interfere with each other and the operating system lacked the controls to prioritize workload, isolate memory and insure system stability. So it's not surprising that virtualization solutions, for example VMWare and Xen, have become very popular in the last couple years. While vendors initially resisted supporting their applications in a virtual machine, market realities changed their minds. Emerging cloud computing infrastructures have virtualization at their core as they strive to serve multiple customers on common hardware. While this form of isolation, both from multiple applications and a variety of server hardware, does not solve the problem at its source, it does provide service and risk benefits beyond a modest lowering of the platform cost.
VMWare appliances are an innovative method to deliver software demonstrations or a complete Linux distribution that you can run on a Windows desktop. A secure environment could be developed to enable a corporate desktop to run as a virtual appliance on an employee's home computer, eliminating the need to lug a laptop back and forth to work every day. The benefits of isolation are just beginning to be explored.
Making isolation, in all its various forms, a key technology strategy will lead to lower costs, improved service levels and some welcome relief for your security department.
Tuesday, November 11, 2008
Planning an Effective Outage
The process can be broken down into the following seven components, which I'll describe in some detail and used to run a four-hour, once-a-month outage.
- Yearly planning
- Change requests
- First group meeting
- Detailed planning meeting
- Second group meeting
- The outage
- Review meeting
The second step can begin at any time, but clearly ends with detailed change requests being filed and approved no later than twelve days prior to the outage. This gives the outage coordinator time to prepare materials for the first group meeting. It's encouraged that all change requests be entered as early as possible, even though the details might be incomplete at first. This gives the outage coordinator an early warning sign if an unusual number of changes or widely impacting changes are being requested. Each request is pegged to a specific outage, making the collection of requested changes a "release".
The first group meeting takes place eleven days prior to the outage and includes the outage coordinator and one person responsible for discussing the details of each change request, preferably the person that will actually be doing the change. This meeting lasts no more than one hour and several key pieces of information are collected for each change, including:
- The business or technical priority of the change
- The riskiness of making the change
- The amount of time needed to make the change
- The amount of time needed to back-off the change
- What resources must be available to make the change
- What resources must be unavailable while making the change
- Make higher priority changes as early in the outage as possible to insure completion
- Make riskier changes as early in the outage as possible to insure sufficient back-off time
- Run parallel streams of changes where they do not have conflicting resource requirements
At the completion of the detailed planning meeting there may be changes that will not fit the allotted time. These are discussed with IT management, and if necessary, business leadership, to determine if additional outage time will be permitted or the changes postponed until the next available opportunity. After gaining agreement, a project plan is crafted and published to the
group for review and updating.
The second group planning meeting is held four days prior to the outage where the project plan is reviewed line by line and any last minutes changes discussed and approved. If everyone has done a good job up to this point, this meeting goes very smoothly and everyone has a clear understanding of the changes and the scheduling.
The outage typically begins two hours prior to the first production application shutdown, during which non-production systems are shutdown and the readiness for the production outage is verified. While I wish I could say that every outage goes exactly according to plan, they simply don't, but being prepared allows the outage coordinator to make on-the-fly adjustments to insure the most critical changes are successfully completed and as many of the other planned changes as possible accomplished.
One very key, and sometimes uncomfortable, requirement during the outage is that everything that goes wrong is documented for followup. An incident is opened for each item and tracked until all items are resolved and closed. The greatest value was gained in fixing the actions performed, not in the changes themselves. An average outage will have several hundred actions (e.g. take down this, verify that, restart this) , in support of only a dozen changes. Resolving all the broken actions leads to a smoother outage, more predictable outcomes and services being restored to their proper working order.
The final step is the review meeting, held three days after the outage. Open incidents are discussed and due dates assigned, which are carried forward to the next planning cycle as long as necessary to insure they are resolved. This closed-loop process is key to continually improving the process. Ideas for improvement are discussed and incorporated into future outage planning by the outage coordinator. The coordinator for the next outage is assigned, allowing several people the opportunity to participate, learn new skills and lend their creative energies.