Monday, May 7, 2012

Linear and Binary Search



I find performance problems with applications a fascination, particularly the ones that involve really unreasonable response times.  A common cause is the lack of understanding and visibility into how searches are performed.  

In simple terms searching can be performed by starting at the beginning of whatever your searching, looking at the first item to see if that’s the one you’re looking for and proceeding until you find it.  This method is referred to as a linear search, after its namesake, the line.  For small amounts of things, a linear search doesn’t take too long, for example, if you’re looking for a particular jar in your kitchen’s spice rack.  But if you have really large number of jars, like a grocery store, you put them on the shelf in order by their name, with allspice coming before rosemary and rosemary before tarragon.  This sorted order allows for a much quicker search since you can eliminate large portions of the spice rack at one time.  You look at the middle of the shelf and find poppy seeds.  Since you’re looking for salt, you know which side of poppy seeds that must be on.  A few quick repetitions of method and bam!, you have salt.  This searching method is called a binary search, and while it takes some effort to put, and keep, the spices in sorted order, it’s well worth it for larger numbers of jars.

Computers love binary searches when dealing with millions or billions of pieces of data.  The math is pretty simple.  If I have a table of 1,000,000 numbers and perform a linear search, I have to, on average, look at 500,000 pieces of data.  A binary search needs to look at more than 20 pieces of data as it divides the data in two, figures out which side what it’s looking for is on and repeats that process.  219 is 524,288, not quite enough, and 220 is 1,048,576, which is a little bit more.  So you just find the power of 2 that’s equal to or larger than the number of data items you have and bam!, that’s your maximum number of tries.  Comparing 500,000 to 20 iterations is a no-brainer.  Now try 1,000,000,000 (a billion).  Linear takes 500,000,000.  Binary takes 30.  ‘Nuf said.

A couple of examples to illustrate how this works using a couple of typical pain points: Excel and Databases.

Excel has a nice function called VLOOKUP that allows searching a range of cells (aka a table).  Excel will do a linear search if the FALSE parameter is used or the range of cells that’s not in sorted order.  Having both sorted data and using the TRUE parameter is needed for a binary search.  This has little impact if you’re dealing with small sets of data and a few VLOOKUPs.  Excel searches 100,000,000 cells per second on my laptop.  That’s a lot, unless you’re searching 10,000,000,000 (ten billion) cells, in which case it takes 100 seconds.  But using a binary searches would take less than ¼ of a second, a 400 times improvement.  An excellent writeup on how to code VLOOKUPs using the TRUE parameter can be found at:

http://fastexcel.wordpress.com/2012/03/29/vlookup-tricks-why-2-vlookups-are-better-than-1-vlookup/

Databases also make extensive use of binary search technology to deliver good performance.  The primary tool used is the index, and there can be many indexes for one database table in order to provide a number of different ways to efficiently (i.e. binary search) find the row or rows desired.  When no index can be used, the database system must do a brute force search (i.e. linear search), and the larger the table is, the slower that search will be.  Even in cases where an index can be used, that index might not reduce the number of rows that must be inspected to provide good performance.  

You may encounter poor performance on smaller tables more often than larger tables.  Large tables tend to have more attention paid to them earlier than small tables.  Queries against tables when they’re small may perform just fine, since linear and binary searching are not all that different at smaller scales.  But when the small table grows over time, that lack of early attention slowly degrades performance.  Having a discussion with your database administrator to discuss tuning your query, adding an index or applying some other optimization technique can result in the same type of magical improvement in performance that the Excel example above delivered.  

Most importantly, just don’t accept bad performance as a fact of computing life.  In most cases there are alternatives and improvements available to solve your problem.

No comments: