Data Science

Don't get Hadooped

Nick Elprin2014-10-16 | 7 min read

I have seen many companies that are interested in improving their analytical capabilities, and I noticed a disconcerting pattern: in many cases, companies are drawn to Hadoop (or Spark, or similar solutions) not because it solves a real problem they have, but because it is a shiny, trendy piece of technology.

Don't get me wrong: Hadoop (and its ilk) is a fantastic piece of technology, and there are plenty of problems that call for it. But that doesn't mean it's right, or even helpful, for everyone. I want to describe three common mistakes I see when people are irrationally attracted to Hadoop.

1: Medium Data

It's been said that "big data is like sex in high school: everyone talks about it, everyone claims they're doing it, but hardly anyone is." Typically when someone tells me they have big data, and I ask them how much data they actually have, they sheepishly reply with something between 10 and 100GB.

If your data can fit in memory on a standard EC2 machine, you do not have big data

Hadoop is a great solution for truly large data sets: Internet scale data, bioinformatic data, some financial transaction data, etc. But if you have less than 250GB, then your entire data set can fit on a single machine. And that machine even has 32 cores, which makes it a pretty good tool for running some map-reduce work.

2: Hadoop != Map-reduce

I also hear people conflate Hadoop and map-reduce. Hadoop is a powerful flavor of map-reduce, but it's quite easy to run map-reduce analysis on a single machine, using great libraries in R, Python, and other statistical languages. All these languages have great tools for mapping your tasks over multiple cores. You can even use clusters in IPython Notebook to map your tasks across cores.

Again, unless you have a scale of data that requires more than, say 100 core or a couple hudred gigs of memory, you may not need the complexity and maintenance costs of Hadoop. Map-reduce on a single machine may be great for you.

3: Hadoop is a technology, not a solution

Probably the most common, and most subtle pitfall I see companies fall into is that they are expecting Hadoop (or some other hot technology) to be a solution when they don't really understand their actual problem. Usually this happens when a company wants to improve its analytical capabilities, but doesn't know how to start. They hear about trendy technology products, and they believe that bringing these products onboard will get them started in the right direction.

Before pursuing a "big data solution"...

When we dig in with companies, we often unearth a set of problems that call for a very different solution. They're more complicated, yes, but they are foundational issues that will make a real difference when addressed.

Most often, the biggest problem for the companies I talk to is the lack of a standardized workflow that facilitates best practices. When I give advice about how to stand up a modern analytics capability, it usually starts with good hygiene:

Keep work centralized, to enable sharing and knowledge management.
Use version control so you can reproduce past work.
Organize your work so it's portable, i.e., it's not tied to any specific person's machine. (This involves removing hidden dependencies on absolute file paths, system-wide libraries, etc.)
Write your code so that it is modularized, so the right parts can be reused, and the right parameters are exposed (rather than hard-coded).
Keep work in a single language throughout the lifecycle of a model (interactive exploratory work, through refinement, to deployment). Don't create a world where you need to be translating your code between different languages at different steps of your process.

Hadoop is not a solution to these problems. In fact, you can easily create an unsustainable mess by building an analytical workflow on top of Hadoop without thinking through what your real problems are.

Diagnose your real problems

Here are some questions you should ask before pursuing a "big data" solution:

What are your biggest problems and pain points around your current analytical capabilities? If you could "wave a magic wand" and change anything about your analytical process, what would it be?
Will a big data solution actually address your pain points? And if so, what else will be required to turn a piece of technology (e.g., Hadoop) into actual business value for you? For example, will any workflows or business processes need to change? What will these changes cost or require (for example, changing current code to work in a Hadoop paradigm)?
Figure out, analytically, how many compute resources you actually need, in terms of cores and memory. You can do this with some back-of-the-envelope math:
- How much data do you have?
- How many independent "parts" can it be partioned into?
- How long does your computation take on each part?
- In how much time do you want your entire calculation to finish?

One quick example: I spoke to a sophisticated analytical consulting firm that insisted they needed to build an in-house solution on top of Apache Spark. When I asked about their specific use cases, they cited a need to process a dataset with 10,000 records, where each record would take 10s to process, and they needed a solution that would finish the calculation within a few hours. But on a single, 32-core machine, this would take under an hour.

An Alternative Solution

I would be remiss if I didn't mention, briefly, that we have built to be a data science platform that facilitates best practices throughout the analytical lifecycle. Domino integrates nicely with Hadoop, of course -- Domino can run and track Hadoop jobs, just like it can run and track tasks in R, Python, and other languages -- but it may be the case that you don't really need Hadoop. Or at the very least, you might need it and a well-designed platform to place it on.

Spread the word

If you, or others you know, have been a victim of Shiny Technology Syndrome and fallen for Hadoop when you didn't need it, please share this with others.

Nick Elprin

Nick Elprin is the CEO and co-founder of Domino Data Lab, provider of the open data science platform that powers model-driven enterprises such as Allstate, Bristol Myers Squibb, Dell and Lockheed Martin. Before starting Domino, Nick built tools for quantitative researchers at Bridgewater, one of the world's largest hedge funds. He has over a decade of experience working with data scientists at advanced enterprises. He holds a BA and MS in computer science from Harvard.

Summary

Subscribe to the Domino Newsletter

Receive data science tips and tutorials from leading Data Science leaders, right to your inbox.

By submitting this form you agree to receive communications from Domino related to products and services in accordance with Domino's privacy policy and may opt-out at anytime.