Welcome to the second part of my journey to explore the world of big data (read part one here). Though I am an experienced database developer familiar with both transactional and business intelligence systems, the world of high-volume, complex, fast-moving data blithely labeled “big” is sending me back to school. Join me in my efforts to beat back ignorance and ship cool solutions.
At this point, you may be convinced that your company has big data that is underused or undervalued; and you want to find a way to import it, organize it, and start mining it for its incalculable business value. Allow me to step in and give you a single word of advice:
Don’t.
What? Consultants are supposed to write about why you should buy what they’re selling. But let’s look at the implementation details. Jumping into the deep end of your first big data project involves quoting and procuring huge amounts of storage; choosing, buying, and acquiring skills on new data analysis tools; and almost certainly staff augmentation, in terms of new head count or outside vendors. With that many variables, there is a lot of potential downside. One or two costly mistakes could cripple the project.
For your first adventure, I would advise using the tools and storage you already have, and learning to sip from the fire hose. Just because something is possible doesn’t mean it’s feasible, especially for a team charting new territory. Here are four ways to cut big data down to size, so you can acquire skills and demonstrate business value using only a few hundred gigabytes of data, instead of petabytes.
Subset. Let’s say you have access to a support forum or social media feed, with a large and active base of users and messages. Big data nirvana would be to import the entire data store, and use natural language processing to determine preference, intent, and intensity. Who likes us and why? What do they wish we did differently? Who are the opinion leaders? This project would be amazingly cool, but daunting in its complexity.
What if you didn’t import the text? By using only a subset of the available fields, you could import data more quickly, use less storage, and perform simplified but valuable analysis. If you import the date, and the IDs of the forum category, thread, message, and user, each record would be less than 50 characters wide. You could import billions of them, and analyze message volume over time, messages per thread, who posts, and how often. You could even identify clusters of users who post together or form communities. By tying in summary data regarding marketing activities and sales, you could see how forum activity drives or reacts to external events. All using a database easily handled by your existing tools, and a few gigs of unused storage. And all without importing or parsing a word of forum text.
Segment. A data store may be too big to process in its entirety. But you could learn a lot and develop skills for future projects by dividing the data by date or geography, and processing a smaller segment. You may have data for five continents and five years. But your pilot project might be to look at Canada last year, or the U.S. last quarter. Start small with big.
Sample. Instead of importing all data for a small segment, you could import a representative sample of the whole. Locate or create a varying numeric field, and use its last two digits to select a random 1% of the data to process. Sampling or segmentation can give your executive stakeholders a taste of both the value contained in big data, and the difficulty in obtaining that value—without making them commit to a full-scale project. Lessons learned can be used to guide planning for future phases.
Summarize. In this approach, you import all the raw data (or as much as you can, using segmentation or sampling), summarize key characteristics, and store only the summary for later analysis. This approach drastically reduces the total storage space required. This might be a good phase two approach, because your team will need to deal with full-scale velocity (can you process all the data as quickly as it is being generated?), but can make compromises regarding data variety/complexity and volume.
Your first big data implementations will drive lots of learning—new concepts, vocabulary, and expectations. By starting with segments, subsets, samples, or summary data, you can use the database tools you’re already proficient in, and not have to start over on a new platform as well. Plus the reduced size of the initial data sets will give you a scale model you can use to refine your techniques.
Medium—it’s like big, only smaller. Next time we’ll look at some of those new tools. I look forward to reading your comments, which will help guide the direction of future posts.
Great advice Norm. One thing I wonder about is how to “sell” this strategy up the chain of command. I can easily see c-suite people saying they want all of the data processed. “We don’t want to leave anything on the floor.” I think that if you bring up some of these alternatives along with the many variables you pointed out, it could lead to success.
I think you can sell it by analogy. One good one would be a scatter plot. You don’t need every data point in a scatter plot to see the trend. In the same way, you can gain valuable insight when only working from a sample or segment of the data.
But I think the approach that C-suite stakeholders would appreciate most is that this approach minimizes risk and cost. And it’s intended as a transition to a big data implementation, not a replacement for it. A limited pilot project allows the team to build their skills and vocabulary and start tackling the data volume and velocity challenges, without having to eat the whole elephant in one bite.
nb