Technology ViewPoint: 2012

Thursday, December 27, 2012

Google BigQuery

During the recent Google I/O 2012 conference I watched one of the keynote sessions from the comfort of my favorite web browser and was introduced to their BigQuery service, which is the public version of Google Dremel, their internal tool for analyzing large datasets. I was intrigued by the demonstrations on a dataset of 137 million records with query response times in the 3-5 second range. But was this like the tomato-slicing machines hawked on television that work great for their well-practiced spokesperson, but do a better job of making tomato juice in my kitchen? But if this few order of magnitude difference in performance was real, it could be a great benefit, and since the cost to try it out amounted to pocket change, I decided to see for myself.

First a little background on the three key differences between using BigQuery and the familiar relational database technology. BigQuery uses a table scan for everything. No indexes or other mechanisms to write data to disk in a manner that may help later retrieval. Its those key differences that make this happen with great speed.

The first difference is using a column-oriented database approach, which simply is writing a table to disk column by column instead of row by row. Row by row is great for finding one or a few rows, like is typically needed for executing transactions, but would require reading the entire table to read a single column. By storing the data column by column, an analytic query can just read the columns requested, greatly reducing the amount of data that needs to be processed.

The second difference is a high degree of compression. Since the data in a column is the same type and frequently contains large amounts of duplicates, it’s much more likely to compress well, quite often in the 10-to-1 range. So for example, say we have a 100GB table with 100 equally-sized columns and 10-to-1 compression and we run a query retrieving 5 columns. Instead of reading 100GB we read just 500MB, a considerable improvement.

The third difference is the number of servers that participate in the query. While Google doesn’t comment on how many servers a query will be spread across, and it likely will vary on the size of the table queried and other factors, they use enough that the resulting response time stays so fast that people are motivated to use it alot. It’s a simple equation. The more you use, the more money they make, and the faster it performs the more you’re likely to use.

For my test case I had 87,232,116 records consisting of 139 columns, for a total of about 45GB of data. I’m not saying this is “big data”, but it’s large enough to be interesting and this had never before been attempted before due to performance concerns. I compressed the data into gzip (.gz) files no larger than 1GB each, uploaded them to Google Cloud Storage and imported them into BigQuery using their Python-based BQ command line tool. There are a few other setup steps that preceded this and the data was already in a form, pipe-delimited, that was compatible. Then using the BigQuery web browser interface (bigquery.cloud.google.com) I ran several dozen queries, none that took more than 5 seconds to complete. I also downloaded their Excel add-in which allows queries to be executed from inside a spreadsheet, with equally impressive results.

The cost to use BigQuery is straightforward. Twelve cents ($.12) per month per GB stored and three and a half ($0.035) per GB scanned. The first 100GB scanned per month is free. So my testing cost $5.40, all in storage costs. No really a bank breaker.

Tuesday, October 9, 2012

The Outcome-Value Statement Revisited

Awhile back I wrote a blog on a method I use to gain clarity on requirements by describing an Outcome, which is simply what you are trying to accomplish without describing how you plan to get there. These Outcomes are what drive Values, which is an change in cost or revenue, service provided or risk present. When that's clearly understood, then the best Projects, Requirements or the Activities that lead to the Outcome can be determined. As I describe this framework to people, I find that I need to provide examples that are more easily understand. To that end I turn to the game of golf.

I find golf more of a mental game than a physical game, which is good for me since I'm not particularly physically gifted (or maybe not at all) nor a natural born golfer. But I can hit a variety of shots: high, low, fade, draw, etc.. And this is simply a matter of physics. A golf ball only responds to the limited number of forces you can impart on it, and after it leaves your club head, gravity, air and the ground take over. The ball will react to where on the club face you strike it, and the forces of direction, velocity and acceleration in each the x-axis, y-axis and z-axis. So you only have ten Outcomes to consider when making a shot. When you achieve those Outcomes, you’re ball will very predictably go exactly where, and how, you wanted. It doesn’t matter if your Jack Nicklaus, Happy Gilmore or a pudgy fifty-something. The golf ball isn’t looking at you. It just responds to the club.

Now let's turn to the shot you're attempting to hit. This is the Value. Let's say you're on the tee of a long, dog leg left par 5. A long tee, shaped right to left would be of Value. So you swing hard and hope you're normal slice magically disappears this one time. As is typical, when you don't want a slice, you get a bigger one. Now you're in the woods and if you're lucky you find your ball and chip it back in the fairway. And swear next time you'll tee off with an iron instead. Or maybe you just stop at the swearing.

What was missing from this errant tee shot was a description of the Outcome you needed in order to achieve your Value. Hitting a long draw requires some specific Outcomes. The most important Outcomes is this case are (1) hitting the ball on the sweet spot, (2) having the club head square at impact, (3) the club head having a high velocity, (4) the club head moving from left to right relative to the ball and (5) the club head having a moderate rate of acceleration. These five Outcomes will cause the ball to fly a long way due to having the proper trajectory, a large amount of energy and some counter-clockwise spin (as viewed from above the ball). How you archive these Outcomes is irrelevant. You can hit the ball with a tin can, a baseball bat or a golf club and the ball will fly exactly the same way, given the same Outcome is achieved.

Armed with this viewpoint, you then can begin to try to figure out how you can accomplish this Outcome, or maybe just come to the conclusion you just can't that shot no matter how hard you try. If you can't hit the ball far enough, having a successful draw on the ball simply puts you behind some trees with no direct second shot. If you just can't hit a draw to save you life, a long tee shot puts you though the fairway and no better off. For example, I really struggle to hit a fade, which for me, a right-hand golfer, means the ball goes left to right. I know that's because I stand farther from the ball than most people, and that makes it next to impossible to come across the ball from right to left and cause the Outcome, a clockwise spin, that I know I need. So I don't try to hit that shot.

Another example occurred a number of years back. Towards the end of the round, one of my playing partners was about forty yards behind a fairly tall tree. He elected to chip back out to the fairway. I told him that the tree wasn’t really in the way. To demonstrate, I dropped an extra ball and launched an 8-iron over the top of the tree. He was amazed, mainly because he noticed that I normally have a lower trajectory on my iron shots. He asked me how I did it. I responded “I accelerated through the ball”. That causes the ball to stay in contact with the club head a little longer, which in turn causes the ball to roll up the cub a bit more and get more benefit from the loft of the 8-iron. That’s the outcome I wanted. How did I do it? A shorter back swing and a longer follow through. For me, that combination results in the needed acceleration. But again, the Outcome matters, not how you get there.

So in golf, I suggest studying the physics of the game first and clearly understand what makes the ball do what it will do. Only then begin figuring out the hows, like the grip, elbow and stance. Getting a firm grip on Outcomes will make you more effective, at work or on the golf course.

Tuesday, October 2, 2012

Do More With Less

If I hear another executive spout that we need to “do more with less”, I might just scream. Apparently they don’t read their audience’s reactions, which range from rolling eyes to demoralization to demonization. It says to them that management can’t figure it out, so we’ll be cutting spending and working longer hours. Or perhaps like in the movie Ben Hur, we’ll force the slaves to make bricks without straw. I’ve never seen it inspire the troops or become the rallying call to action.

But what are they trying to say? We have to become more productive. That simple. Not any more enlightening perhaps, but at least it’s a better starting point for a conversion. And it’s a lot less threatening, so perhaps some folks will start to engage to figure out how to measure productivity and improve upon it. I’ve been around long enough to see a few ways that might help you in figuring your plan out.

Back in the early 1980’s, our company had a small round of layoffs, and our department had to reduce its staff by seven people, roughly a 5% decline. That’s certainly fits the “with less” side, but it would result in doing “a little less”. So we used this opportunity, with the company’s approval, to reduce staff by fifteen and then hired eight new people. So the bottom 10% were let go and a more talented 5% added. At the end of the process we were more productive.

Information Technology have a built-in advantage in that hardware and telecommunication costs decrease year-over-year at roughly the pace described by Moore’s law, which was an observation made by Gordon Moore, Intel’s co-founder. He thought the number of transistors on an circuit would double every two years for the next ten years. It turns out that prediction has lasted for the last forty years without any obvious end in sight. This results in an exponential curve of productivity and the built-in IT advantage, at least if you’re prepared to take advantage. I ran a network group for several years at a company roughly the size of my current company. Over the twenty year span of then to now, the network budget decreased about 80% and where a T-1 line (1.5 Mbp/s) was considered top-of-the-line, we routinely deploy lines 10 to 30 times faster. Phone calls used to be $0.25 per minute; now they are under $0.02 per minute. Showing improved productivity as the network group manager was a pretty easy task and it funded other parts of the IT group to tackle new projects.

My advice for improving productivity is starting with throwing out costly, time-consuming, lower-value work and taking some of that savings to fund higher-value projects, particularly those projects that need early funding to create lower costs in the future. You’ll have to have a very good understanding of your costs, be willing to change anything, break off from traditional vendors relationships and invest in choices that prepare you for lower costs later. But that can be a lot of fun. A lot more than making bricks with no straw.

Friday, September 14, 2012

True Collaboration

The finest example of true collaboration I had the pleasure to witness took place during an SAP project years back. It started out simply enough when I needed to produce some statistics from SAP but lacked the location of each user. We needed to get that data into each SAP user’s profile from some external source. A simple problem statement, yet lacking a simple answer.

The solution presented itself through a most unlikely collaboration source: a simple email stream. It started with an idea that would work, but at a fairly steep cost, in the six figure range. That first email was sent to about a dozen people. A short time later, someone else improved on the first idea. Then another, and then another. Somewhere in the middle of the ten or so emails that eventually became part of a stream of ideas, I improved on the idea. And then my idea was further improved. At the end of the stream, the final idea would take a couple hours of time and no further outlay of dollars. I sat amazed at this string of creativity and the fantastic solution.

Then I took a step back and realized how fantastic and unique this was from a people standpoint. And how the credit for this was not just in the final idea’s creator, but everyone involved.

The person submitting the original idea took what most people would consider a bold and probably an unwise risk. But it took some uncommon bravery to write down a well thought out idea for a group of bright people to critique. That bravery cannot be understated, and even though we all knew each other pretty well, it can still be a risky thing to do, particularly to one’s own ego. But without that start, it’s likely the problem would not have been solved. He deserved a special thank you and a nice chunk of the credit.

Then there were the group of people, myself included, that incrementally improved on the first idea. And although we also didn’t find the final solution, we kept the energy alive and the ideas flowing. Each of us deserve some of the credit for getting to the final solution.

The person with the final solution certainly deserves their share of credit. They designed a very elegant solution that was quick to implement and at only the cost of a couple hours of time.

Bravery, energy and ideas are the lifeblood of collaboration, not cool social media tools. Start and end with people. Give credit to everyone that participates. The rest (tools) will take care of itself.

Monday, July 9, 2012

5-50-500-5000

I mostly hear complaints about Change Management. Too much paperwork, too many meetings and the process slows everything down. All true, and if the Sarbanes-Oxley legislation didn't exist, I believe most IT departments would have abandoned the process years ago. And that would have been an absolute shame.

A decent Change Management process is trying to tell you how risky the change you're making is in terms of the impact to the business if things go wrong. It's also trying to match the level of testing, backout planning, etc. to mitigate that risk. And the underlying cause in many instances is that we've designed our technology solutions as "big bang" implementations. Have you noticed how big Internet and mobile device companies implement their changes? They typically have a beta program which engages risk-tolerant people first. When a few cycles of this passes and the known bugs are worked out, they begin a slow trickle of upgrades, ready to halt the process at a moments notice. When all is good for a decent chunk of their users, they upgrade the remainder in short order. They've avoided the "big bang" approach, shortened their cycle times and not upset their customer or their business.

Sure, you say, they have advantages internal IT shops don't have, and in some cases that's true. But in many other cases we do, and just haven't. Which brings me to the title of this blog, 5-50-500-5000.

Years ago our email system, Lotus Notes, became an increasingly important service. New software releases came out frequently and offered compelling new features and performance gains, and we wanted to deploy them as quickly as possible. But with 5000 email users on a single system and everyday business counting on it, any major change was a very high risk. A test system was of little help, since a few technical people could not adequately test everything, primarily because 5000 users do a lot of different things. Our solution was to break up the email system into four partitions, while still making it appear as a single email system. The first partition held about 5 users, just members of the core technical staff. The second held about 50 users, a mixture of IT and risk-tolerant users. The 500 system was a broad, representative set from across the entire organization. The remainder fit into the 5000 set.

The risk, to the business, of upgrading either the 5, 50 or 500 groups was very tolerable. When it came time to upgrade the 5000 group, we had reasonable assurance that things would go smoothly, again, with respect to the entire business. These Change Management meetings typically went smooth and short, which they should when the risk was largely mitigated by the facts at hand. The real victory were users that experienced few problems and enjoyed new email features.

Listen, learn and adapt your IT services to what your processes are telling you. Your customers, and yourself, will benefit from the results.

Monday, May 7, 2012

Linear and Binary Search

I find performance problems with applications a fascination, particularly the ones that involve really unreasonable response times. A common cause is the lack of understanding and visibility into how searches are performed.

In simple terms searching can be performed by starting at the beginning of whatever your searching, looking at the first item to see if that’s the one you’re looking for and proceeding until you find it. This method is referred to as a linear search, after its namesake, the line. For small amounts of things, a linear search doesn’t take too long, for example, if you’re looking for a particular jar in your kitchen’s spice rack. But if you have really large number of jars, like a grocery store, you put them on the shelf in order by their name, with allspice coming before rosemary and rosemary before tarragon. This sorted order allows for a much quicker search since you can eliminate large portions of the spice rack at one time. You look at the middle of the shelf and find poppy seeds. Since you’re looking for salt, you know which side of poppy seeds that must be on. A few quick repetitions of method and bam!, you have salt. This searching method is called a binary search, and while it takes some effort to put, and keep, the spices in sorted order, it’s well worth it for larger numbers of jars.

Computers love binary searches when dealing with millions or billions of pieces of data. The math is pretty simple. If I have a table of 1,000,000 numbers and perform a linear search, I have to, on average, look at 500,000 pieces of data. A binary search needs to look at more than 20 pieces of data as it divides the data in two, figures out which side what it’s looking for is on and repeats that process. 219 is 524,288, not quite enough, and 220 is 1,048,576, which is a little bit more. So you just find the power of 2 that’s equal to or larger than the number of data items you have and bam!, that’s your maximum number of tries. Comparing 500,000 to 20 iterations is a no-brainer. Now try 1,000,000,000 (a billion). Linear takes 500,000,000. Binary takes 30. ‘Nuf said.

A couple of examples to illustrate how this works using a couple of typical pain points: Excel and Databases.

Excel has a nice function called VLOOKUP that allows searching a range of cells (aka a table). Excel will do a linear search if the FALSE parameter is used or the range of cells that’s not in sorted order. Having both sorted data and using the TRUE parameter is needed for a binary search. This has little impact if you’re dealing with small sets of data and a few VLOOKUPs. Excel searches 100,000,000 cells per second on my laptop. That’s a lot, unless you’re searching 10,000,000,000 (ten billion) cells, in which case it takes 100 seconds. But using a binary searches would take less than ¼ of a second, a 400 times improvement. An excellent writeup on how to code VLOOKUPs using the TRUE parameter can be found at:

http://fastexcel.wordpress.com/2012/03/29/vlookup-tricks-why-2-vlookups-are-better-than-1-vlookup/

Databases also make extensive use of binary search technology to deliver good performance. The primary tool used is the index, and there can be many indexes for one database table in order to provide a number of different ways to efficiently (i.e. binary search) find the row or rows desired. When no index can be used, the database system must do a brute force search (i.e. linear search), and the larger the table is, the slower that search will be. Even in cases where an index can be used, that index might not reduce the number of rows that must be inspected to provide good performance.

You may encounter poor performance on smaller tables more often than larger tables. Large tables tend to have more attention paid to them earlier than small tables. Queries against tables when they’re small may perform just fine, since linear and binary searching are not all that different at smaller scales. But when the small table grows over time, that lack of early attention slowly degrades performance. Having a discussion with your database administrator to discuss tuning your query, adding an index or applying some other optimization technique can result in the same type of magical improvement in performance that the Excel example above delivered.

Most importantly, just don’t accept bad performance as a fact of computing life. In most cases there are alternatives and improvements available to solve your problem.

Tuesday, February 7, 2012

Your Own Device

BYOD (Bring Your Own Device) is all the rage these days with the onslaught of truly portable and useful devices combined with relatively free spending personal technology budgets. But what an employee wants when trying to bring their productivity into their workspace depends on how their company’s infrastructure has adapted, or not, over the last decade as the non-portable versions of their technology, for example a home PC, has been permitted, or not, to access company email and applications.

If UYOD (Use Your Own Device) has gained traction, then BYOD is a simple matter of providing Internet-facing Wi-Fi access while at work. Using, not Bringing, is the key point here and UYOD should be the focus of this effort, after all, these new devices are meant to be used anywhere, and the office is just one place an employee would find using their smart-phone or tablet useful. And UYOD speaks to the much larger effort required to securely deliver and support IT services on these devices. BYOD typically means that a personal device would be connected to the company’s internal network. UYOD only requires the same Internet access that the device has when it’s not in the office.

UYOD is easily enabled by creating a separate Wi-Fi network that only provides Internet access, just like I have defined on my home wireless network to allow my friends Internet access without accessing my home’s internal network. If there were internal services on my home network that I wanted a friend to use, I would find a secure way of providing that, while still keeping them on my guest wireless network.

If the Internet and its hundreds of millions of web sites are built on a UYOD basis, maybe it’s time to see the light.