Thursday, December 27, 2012
During the recent Google I/O 2012 conference I watched one of the keynote sessions from the comfort of my favorite web browser and was introduced to their BigQuery service, which is the public version of Google Dremel, their internal tool for analyzing large datasets. I was intrigued by the demonstrations on a dataset of 137 million records with query response times in the 3-5 second range. But was this like the tomato-slicing machines hawked on television that work great for their well-practiced spokesperson, but do a better job of making tomato juice in my kitchen? But if this few order of magnitude difference in performance was real, it could be a great benefit, and since the cost to try it out amounted to pocket change, I decided to see for myself.
First a little background on the three key differences between using BigQuery and the familiar relational database technology. BigQuery uses a table scan for everything. No indexes or other mechanisms to write data to disk in a manner that may help later retrieval. Its those key differences that make this happen with great speed.
The first difference is using a column-oriented database approach, which simply is writing a table to disk column by column instead of row by row. Row by row is great for finding one or a few rows, like is typically needed for executing transactions, but would require reading the entire table to read a single column. By storing the data column by column, an analytic query can just read the columns requested, greatly reducing the amount of data that needs to be processed.
The second difference is a high degree of compression. Since the data in a column is the same type and frequently contains large amounts of duplicates, it’s much more likely to compress well, quite often in the 10-to-1 range. So for example, say we have a 100GB table with 100 equally-sized columns and 10-to-1 compression and we run a query retrieving 5 columns. Instead of reading 100GB we read just 500MB, a considerable improvement.
The third difference is the number of servers that participate in the query. While Google doesn’t comment on how many servers a query will be spread across, and it likely will vary on the size of the table queried and other factors, they use enough that the resulting response time stays so fast that people are motivated to use it alot. It’s a simple equation. The more you use, the more money they make, and the faster it performs the more you’re likely to use.
For my test case I had 87,232,116 records consisting of 139 columns, for a total of about 45GB of data. I’m not saying this is “big data”, but it’s large enough to be interesting and this had never before been attempted before due to performance concerns. I compressed the data into gzip (.gz) files no larger than 1GB each, uploaded them to Google Cloud Storage and imported them into BigQuery using their Python-based BQ command line tool. There are a few other setup steps that preceded this and the data was already in a form, pipe-delimited, that was compatible. Then using the BigQuery web browser interface (bigquery.cloud.google.com) I ran several dozen queries, none that took more than 5 seconds to complete. I also downloaded their Excel add-in which allows queries to be executed from inside a spreadsheet, with equally impressive results.
The cost to use BigQuery is straightforward. Twelve cents ($.12) per month per GB stored and three and a half ($0.035) per GB scanned. The first 100GB scanned per month is free. So my testing cost $5.40, all in storage costs. No really a bank breaker.