Monday, December 27, 2010

The Perfect Problem

Problems are a fact of life, certainly if you live in the perfectly bit-oriented world of computing where all those bits must line up in their proper place, billions of them, to be processed at rates exceeding a billion times per second.  Seems on the surface to be an impossible task, but that's the comforting life for IT professionals, who relish in keeping the computing universe in alignment twenty-four hours a day, seven days a week.  And when the inevitable problem occurs, it gets diagnosed and repaired at dizzying speed.  Then it's back to programming more billions of, hopefully but rarely, perfect bits.  

But the above is only a fraction of the whole story, and in many cases is the shortest portion, at least from the viewpoint of the user whose service was interrupted for hours or days.  But this fraction is also typically the focus of the root cause analysis, which is focused on avoiding the problem in the future.  And hence a larger opportunity to improve service is wasted.  To get at the bigger picture, begin using the concept I call "The Perfect Problem".

Was is a Perfect Problem?  This will vary somewhat depending on your service agreements, but the general idea surrounds all the other "stuff" surrounding the outside of the technical problem.  Simply put, a Perfect Perfect is reported, dispatched, escalated, recovered and communicated in the way you designed.  It's all the operational aspects that can consume 90-99% of the actual time a service was unavailable.  It's all about asking questions, and a lot of questions, and getting straight answers so everyone gets better.  Some examples of the type of questions you need to ask include:

  • Reported - Did the automated monitoring tools or computer operators see the problem before users started calling the help desk?  Could they be improved to shave minutes or possibly hours from the overall duration of the outage?
  • Dispatched - Did the ticket get assigned to the correct group?  Where the right people paged?  Was the ticket picked up within the specified time frame?  Get the ticket get bounced back and forth trying to find a home?  Was all the information gathered to that point included to avoid wasting time asking for it a second or third time?
  • Escalated - Per procedure, was the problem escalated to a problem coordinator within the proper time period?  Was management made aware of high severity problems?  Were tickets opened with vendors escalated properly?
  • Recovered - After the service was fixed, did it come back up as quickly as usual, or were other steps needed?  Was all the operational start-up documentation accurate?  Could improvements to the documentation make it more clear and less error prone?  Are there ways to speed up restarting the application?  How did users know the service was available again?
  • Communications - Was a message recorded for the automated help line letting users know that you know about the problem?  Did that message get updated on scheduled?  Did it include all the information that should have been there?  Were key business groups made aware of the right problems?  Was the CIO notified that the critical systems she or he cares the most about?  

In other words, a Perfect Problem went exactly like you planned it to go.  Nothing more, nothing less.

To capture improvements to make future problems flow properly, each identified deficiency needs to be generate an appropriate improvement task with a clear description and owner.  Each task remains open, monitored, prioritized and managed until resolved, and open items regularly reviewed by management.  

In my experience the longest outages are not typically caused by unusually difficult technical problems, but deficiencies in executing the surrounding processes.  And that's totally within our control to improve upon.

Thursday, December 23, 2010

N-1-1


We've all heard of 9-1-1, the single number to reach emergency services which began way back in 1968.  But did you know that other countries use 1-1-2, 9-9-9 and a host of other numbers to reach various services.  For example, Brazil uses 1-9-0 to reach the police, 1-9-2 to contact medical services and 1-9-4 to find the fire department, and a handful of other sequences for more specialized services.  That would be too much for me to remember, particularly in an emergency situation.

You also likely familiar with 4-1-1, the short-cut to reach directory services, which can add a significant amount to your monthly phone bill if you're particularly lazy or forgetful.  But did you know that free alternatives exist?  Jingle Network's 1-800-FREE-411, Microsoft's 1-800-BING-411 and Verizon's 1-800-THE-INFO are provided free of charge, although some are advertising supported.  On the plus side, driving directions, sports, weather and other features may be available.  You won't talk to a human being, but with the recent advances in speech recognition, the computer is most likely going to get your request correct more often than not.

Now to the services you may not be familiar with, mostly depending on what part of the country you live.  

2-1-1 is reserved for community services such as affordable housing, homelessness, drug and alcohol programs and suicide prevention.  Many of these 2-1-1 services are run by a local United Way agency, as they are in the Dayton, Ohio area.   They also have a web site and a toll-free number to reach agencies when outside the local 2-1-1 dialling area.  More information on their service, HelpLink, is at www.dayton-unitedway.org/help.php.

3-1-1 is the non-emergency version of 9-1-1, but is available in only a couple dozen, large metropolitan areas.  Its purpose is to easily connect residents to city services and information.  Columbus is the only city in Ohio currently with 3-1-1 service, including their web site at 311.columbus.gov.  Examples of the many services offered include requesting a bulk trash pickup, reporting issues with street lights or pot holes, reporting an abandoned car or complaining about a barking dog.

5-1-1 gets you connected to traffic information and covers a large percentage of the United States, and a coverage map can be found at www.fhwa.dot.gov/trafficinfo/511.htm.  The only service available in Ohio serves the Cincinnati/Northern Kentucky region with an effort called ARTISMIS, which stands for "The Advanced Regional Traffic Interactive Management & Information System".  More information on the services offered are located at www.artimis.org.  You can also try out the service for yourself by dialling 1-513-333-3333.

7-1-1 is the Telephone Relay Service and is provided nationwide for the Deaf and Hard of Hearing with more details provided at 711service.com.  

6-1-1 is used to contact your phone provider.  The informational web site, www.dial611.com, provides more information.  Most locations should have this service, and it should be free

8-1-1 is the nationwide "call before you dig" number, which seeks to protect our underground infrastructure.  Its web site is www.call811.com.

A summary chart of all N-1-1 numbers is available at www.nanpa.com/number_resource_info/n11_codes.html.