Achieving Reproducibility with Conda and Domino Environments

compute environmentscondacontainersdata science platform
Share

Managing “environments” (i.e., the set of packages, configuration, etc.) is a critical capability of any Data Science Platform. Not only does environment setup waste time on-boarding people, but configuration issues across environments can undermine reproducibility and collaboration, and can introduce delays when moving models from development to production.

This post describes how Domino uses Docker to address these environment issues, and more specifically how this approach can integrate with common package management solutions like Anaconda.

Quick Introduction to Compute Environments

Domino Compute Environments let data scientists manage Docker images with arbitrary software and configuration. These environment definitions are shared, centralized, and revisioned — and when Domino runs your code (during model training or for deployment) across its compute grid, your code runs in your environment. This means you get the same environment no matter when you work on your project (even months later), or who’s working on it (e.g., a new colleague who joins your project), or whether it’s being run for development or production purposes.

An example of the power of Domino environments is demonstrated below, the following environment can be used to create an environment which installs Theano and the opencv python libraries.

Any python scripts run in a project with this environment will be able to leverage Theano and opencv reproducibly.

Using Continuum’s Anaconda inside Domino Compute Environments

Continuum’s conda is one of the package and dependency management tools available for the python ecosystem. Some of our customers have asked us “how do Domino and conda work together?” Due to the nature of Domino environments as a superset of conda’s functionality, it is trivial to leverage the power of conda, or any other similar dependency management system inside of Domino, while enhancing its reproducibility and reliability.

Some customers use conda “channels” to manage custom packages. Conda channels allow users to create internal “repositories” of packages, and our customers want to use these channels (instead of say, pip) to install packages in Domino Compute Environments.

This is an easy use case to address. Because Domino environments are built on Docker images, we can simply install conda in a base Environment. We’ll show this by creating an environment with the following Docker instructions, which we sourced online:

RUN \
  # Install Anaconda for Python 3.5 v 4.2.0 (current as of 1/27/2017)
  wget -q https://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh -O anaconda.sh  \
  && yes | bash $HOME/anaconda.sh -b -p /usr/local/anaconda \ 
  && export PATH=/usr/local/anaconda/bin:$PATH \
  && conda update -q -y --all \
  && ipython kernel install

RUN \
  # IPython, Python 3.6; conda can't install due to permissions; conda install soft-linked \
  export PATH=/usr/local/anaconda/bin:$PATH \ 
  && echo 'export PATH=/usr/local/anaconda/bin:${PATH:-}' >> /home/ubuntu/.domino-defaults \

# Allow conda install to access install directory
&& chown ubuntu:ubuntu -R /usr/local/anaconda \
&& chown ubuntu:ubuntu -R /home/ubuntu/*

We have built such an environment and made it available to data scientists using our hosted platform: just select it from the environment menu on your project's "settings" page. If you're running a private deployment of Domino behind your firewall, let us know and we can share this with you. Leveraging this environment, it is now possible to access all of Anaconda’s python packages, and from a terminal. Simply typing conda install opencv will use the conda package manager to handle installation and dependencies.

Adding Dynamic Behaviors to Domino Environments

Domino environments do not have to be fully static, they can retain flexibility and dynamic behaviors, while retaining their reproducible nature. In this section I discuss how we enabled dynamic configuration leveraging an advanced conda feature.

One of the more advanced uses of conda is the concept of “conda environments.” These provide developers a way of declaratively listing all of the packages that are needed by a project in a yaml file. An example of a conda environment can be found in Chris Fonnesbeck’s tutorial for the PyMC3 probabilistic programming environment. Inside of that repository, there is an environments.yml file. This file describes all of the python requirements to run PyMC3 and the example notebooks provided. This environments.yml file looks like this:

Defining environments this way is an appealing vision, but unfortunately falls short in practice since packages often depend on OS-level dependencies and system libraries. Domino Environments provide a superset of the functionality of conda environments, because Domino’s Environments are based on Docker images which allow definition of requirements down to the OS level.

In order to demonstrate the flexibility of Domino environments we extended the previously built conda Domino Environment to support conda environments. If the Environment is used with a Domino project which contained an environments.yml file, it will:

  1. dynamically discover it
  2. build the conda environment
  3. make it available to the execution engine.

Domino environments have a number of powerful hooks available during the execution lifecycle inside of a reproducible run. Notably, we have the ability to inject code before and after execution of the script, in order to provide sensible configuration of environments. It is possible to leverage Domino’s injected environment variables and script advanced behaviors such as this. This example shell script is all it takes to dynamically bootstrap conda environments in a custom Domino environment.

By leveraging Domino’s recently released enhanced git integration, and custom environments, it is possible to leverage the conda distribution, as well as advanced features such as conda environments easily and reproducibility.

There is no risk that this Environment won’t work when shared with a colleague, or picked up in the future. Domino environments guarantee the configuration will behave predictably. The old problems of “it worked on my computer” will not surface again.

Leveraging existing building blocks, we created a custom Environment which allowed our customer to leverage the conda distribution and their existing channels infrastructure, but still get all of the advantages of Domino Environments, such as tight control over operating system configuration, and full reproducibility of experiments.

Conclusion

Domino is built on abstractions which we have curated over years of being a visionary in the data science platform space. Domino Environments provide flexibility and control over your packages, while mitigating the risk of package drift or the inability to recreate a stack or environment.

This approach is sufficiently flexible to subsume more specialized package management tools like conda. To that end, we are happy to now provide a managed Environment with conda pre-installed, and we are happy to make this available to customers with Domino deployed behind their firewall.

Domino Environments, along with our scalable compute and reproducibility infrastructure, empower data science teams to experiment and innovate with unparalleled agility. If you’ve been interested in Domino, but were concerned about some specialized requirement or dependency, request a demo to see how flexible and powerful Domino Environments can be.

Banner by Bouwe Brouwer, CC BY-SA 3.0, via Wikimedia Commons.