I’m an engineer, for better or worse. I live by what I can weigh and measure. When someone tells me they have “big data” and are going to “put it in the cloud,” I put my hand on my wallet. I can’t help it—I get a fairyland feeling. To me, “too big to measure” sounds suspiciously like “too big to fail,” and both terms make me nervous. What is this stuff? My engineer brain rebels.
But there really must be such a thing as big data, because suddenly everyone is talking about it. And people are sure they need it, even if they’re not exactly sure what it is. Join me as I try to shed a little light on the latest buzzword.
How Big Is Big?
Let’s start with a practical example: Take a piece of paper and write out your name, address, phone number, and e-mail, then count the characters. (Engineers like me love this stuff; give yourself a gold star if you used graph paper.) It will probably be a little under 200 characters. Using that as a baseline, we can extrapolate: five address book records per 1000 characters (a kilobyte, or “K”); five thousand entries per million characters (a megabyte, or ”Meg”); and five million entries per billion characters (a gigabyte, or “Gig”). Multiply that by another factor of 1000, and you’ve got five billion address book entries and a trillion characters of data—a terabyte. Almost one entry for each person in the world. That’s a lot of data, but it’s not “big” data—a terabyte or two is well within the capabilities of current SQL databases.
But the next step up is a petabyte—a thousand terabytes. That’s big data, by anyone’s definition. Where and how do you store that much data? What tools do you use to analyze it? And how in the world do you make a backup copy? Now we begin to grasp the problem.
More Than Just Big
Like so many convenient labels, the term “big data” can be somewhat misleading. When we give big data a closer look, we find that more than just its size makes it special. Big data is often discussed in terms of Volume, Variety, and Velocity. We just looked at the issue of Volume above.
Variety refers to the fact that the data is not always regularly formatted, or may link together in very complex ways. Importing millions of expense entries, where each entry has a date, an amount, a category, and a description is a fairly simple task. But how do we import, store, and extract meaning from Twitter or Facebook threads? The job gets messy quickly.
The third challenge is Velocity. Not only is there more data than ever before, it is moving faster. One of the challenges of big data processing is being able to load the data as quickly as it is being generated. Having a system that is never caught up because data never stops flowing in is acceptable; but having a system that is constantly falling farther behind is not.
It might be more accurate to describe big data as “hard-to-handle data.” But that doesn’t roll off the tongue so well. The reason we pursue big data processing even though it’s difficult is that there is a huge amount of business value in these large, fast-moving, complex data stores. There are enough fish in this ocean of data to feed us all. But we need a very big net, and a boat to haul it.
The New World
In my career as a database developer, I have gone from small-business systems running on a single machine, to client-server transactional systems driving web-based e-commerce platforms, to terabyte-scale data warehouses. But the new tools and new challenges of big data make me feel like a beginner again. And perhaps that is what I resent. It’s time for me to stop being the teacher and start being a student. I have a lot to learn, and hope to share that new knowledge in future posts.
Where are you in your big data journey? Skeptic, seeker, explorer, guide, or tycoon? Hopefully not a casualty. Let me know about your successes and struggles in the comments. Your input will help guide the discussion in future posts. Read part two of this series here.