The Real Value of Containers for Data Science

data sciencedocker

Every year, $50 of your taxes is invested in research that can't be reproduced.
Erik Andrejko, VP Science, The Climate Corporation, speaking at Strata+Hadoop World San Jose 2016

Walking around the Strata+Hadoop World expo floor in San Jose this week, it was clear that vendors had caught the container bug. Last year was after all the year of the container. It struck me that container technologies such as Docker were primarily being used for compute infrastructure. And yet the real value for data scientists lies elsewhere.

Vendors were delivering scalable compute and storage, and speakers were discussing deploying Hadoop on a container, or running containers on top of Hadoop. But compute infrastructure only indirectly affects the data scientists' ability to conduct research. While there are efforts to bring containerized scientific R and Python stacks to the desktop, for many end-users this is a non-trivial exercise.

At the conference yesterday, The Climate Corporation's VP of Science, Erik Andrejko spoke about "Putting the science in data science". He emphasized the principle of reproducibility as core to the scientific method. And yet science is facing a crisis: the inability to reproduce research results. A rather startling example: Recent research found that nearly 90% of studies in drug discovery programs were irreproducible (for more on this, see our recent interview with Erik).

This is the real value of containers in data science: the ability to capture an experiment's state (data, code, results, package versions, parameters, etc) at a point in time, making it possible to reproduce an experiment at any stage in the research process.

Reproducibility is critical for quantitative research in regulated environments — for example, documenting the provenance of a lending model to prove it avoids racial bias. Even in non-regulated contexts, however, reproducibility is critical to the success of a data science program, because it helps capture assumptions and biases that may be surfaced later on.

The scientific process is built on cumulative insight. This accumulation may be driven by a group of data scientists collaborating over a period of time, or a lone data scientist building on past experience. This process only works if past results are reproducible.

Containers are an ideal technology for supporting a scientific approach to data science. We’ve made heavy use of Docker within Domino from the beginning, and we’re excited to share more about our experiences and lessons from working with containers in upcoming blog posts.