Sunday, December 22, 2013

Redundancy Options

Computer systems can be architected to provide redundancy and recovery using a variety of technologies, such as Microsoft Server clustering, IBM DB2 data sharing and Cisco’s Hot Standby Router Protocol (HSRP).  The discussion can get very confusing very fast, so awhile back I made up a few terms that describes what the final result ends up delivering, which is all that really matters.

The first term is Failover, which provides a rapid switch from a failing primary service to a ready-to-go secondary service.  Failover solutions result in the user experiencing an unusually long response time and possibly the failure of their current transaction, but the user is still connected and would not be required to log back on.  Failover solutions, in my experience, work only about 50 percent of the time, the result of two causes.  First, most Failover solutions are architected using an active-passive rather than an active-active design.  This typically results in the passive side not being used for months before it’s called to active duty and for a variety of causes doesn’t cleanly accept the Failover.  The second cause is the lack of a clear, hard failure.  Failover tends to work well when the primary fails hard, such as a total hardware failure.  Failover tends to work poorly when only a portion of the primary experiences problems.  Either the Failover doesn’t get initiated at all or only a portion starts to move.  In either case you don’t get the result you need.

The second term is Fallover, as in “you fall over and get back up”, and results in the user being disconnected from the service and having to log back in again.  For example, an SAP ERP implementation typically has several application servers, and a web application has several web servers, any of which can provide service to the user.  Which one the user gets connected to is decided at login time, but in the case of that server’s failure, the user simply logs in again and a different, working server is selected.  Fallover tends to work very well because it’s a much simpler solution than Failover and less costly.  Failover usually involves twice the expense to build a fully capable secondary.  Fallover typically involves buying just one extra server, adding perhaps 10% to the total cost.

The third term is Findover, and like Fallover, is a made up word to make a series of words that are easy to remember.  Findover solutions involve finding a secondary service that provides the exactly the same thing as the primary.  A list of Domain Name Servers (DNS) provide a type of Findover.  If a PC or server can’t contact the first DNS server in the list, it tries the next one, and repeats the process until it either contacts an active server or runs out of options.  IBM Lotus Notes servers can be configured to continuously replicate data between each other and if one goes down, the Lotus Notes PC client software will automatically find one of the other replicas.

Failover, Fallover and Findover.  Hopefully an easy to remember list of options.  

And a colleague of mine made up a fourth, self-explanatory term to describe that lack of a recovery option.  


Say no more.

No comments: