Skip to content

    Reproducible Machine Learning with Jupyter and Quilt

    on December 19, 2017

    In this guest blog post, Aneesh Karve, Co-founder and CTO of Quilt, demonstrates how Quilt works in conjunction with Domino's Reproducibility Engine to make Jupyter notebooks portable and reproducible for machine learning.

    Reproducible machine learning with Jupyter and Quilt

    Jupyter notebooks document the interaction of code and data. Code dependencies are simple to express:

    import numpy as np
    import pandas as pd

    Data dependencies, on the other hand, are messier: custom scripts acquire files from the network, parse files in a variety of formats, populate data structures, and wrangle data. As a result, reproducing data dependencies across machines, across collaborators, and over time can be a challenge. Domino's Reproducibility Engine meets this challenge by assembling code, data, and models into a unified hub.

    We can think of reproducible machine learning as an equation in three variables:

    code  +  data  +  model  =  reproducible machine learning

    The open source community has produced strong support for reproducing the first variable, code. Tools like git, pip, and Docker ensure that code is versioned and uniformly executable. Data, however, poses entirely different challenges. Data is larger than code, comes in a variety of formats, needs to be efficiently written to disk, and read into memory. In this article, we'll explore an open source data router, Quilt, that versions and marshalls data. Quilt does for data what pip does for code: packages data into reusable versioned building blocks that are accessible in Python.

    In the next section, we'll set up Quilt to work with Jupyter. Then we'll work through an example that reproduces a random forest classifier.

    Launch a Jupyter notebook with Quilt

    In order to access Quilt, Domino cloud users can select the "Default 2017-02 + Quilt" Compute environment in Project settings. Alternatively, add the following lines to requirements.txt under Files:

    quilt==2.8.0
    scikit-learn==0.19.1

    Next, launch a Jupyter Workspace and open a Jupyter notebook with Python.

    Quilt packages for machine learning

    Let's build a machine learning model with data from Wes McKinney's Python for Data Analysis, 2nd Edition. The old way of accessing this data was to clone Wes' git repository, navigate folders, inspect files, determine formats, parse files, and then load the parsed data into Python.

    With Quilt the process is simpler:

    import quilt
    quilt.install("akarve/pydata_book/titanic", tag="features",
    force=True)
    # Python versions prior to 2.7.9 will display an SNIMissingWarning

    The above code materializes the data from the "titanic" folder of the akarve/pydata_book package. We use the "features" tag to fetch a specific version of the package where a collaborator has done some feature engineering. Each Quilt package has a catalog entry for documentation, a unique hash, and a historical log ($ quilt log akarve/pydata_book).

    We can import data from Wes' book as follows:

    from quilt.data.akarve import pydata_book as pb

    If we evaluate pb.titanic in Jupyter, we'll see that it's a GroupNode that contains DataNodes:

    <GroupNode>
    features
    genderclassmodel
    gendermodel
    model_pkl
    test
    train

    We can access the data in pb.titanic as follows:

    features = pb.titanic.features()
    train = pb.titanic.train()
    trainsub = train[features.values[0]]

    Note the parentheses in the code sample above. Parentheses instruct Quilt to "load data from disk into memory." Quilt loads tabular data, as in features, as a pandas DataFrame.

    Let's convert our training data into numpy arrays that are usable in scikit-learn:

    trainvecs = trainsub.values
    trainlabels = train['Survived'].values

    Now let's train a random forest classifier on our data, followed by a five-fold cross-validation to measure our accuracy:

    from sklearn.model_selection import cross_val_score as cvs
    from sklearn.ensemble import RandomForestClassifier

    rfc = RandomForestClassifier(max_depth=4, random_state=0)
    rfc.fit(trainvecs, trainlabels)
    scores = cvs(rfc, trainvecs, trainlabels, cv=5)
    scores.mean()

    The model scores 81% mean accuracy. Let's serialize the model.

    from sklearn.externals import joblib
    joblib.dump(rfc, 'model.pkl')

    We can now add the serialized model to a Quilt package so that collaborators can replicate our experiment with both the training data and trained model. For simplicity the titanic sub-package already contains our trained random forest model. You can load the model as follows:

    from sklearn.externals import joblib
    model = joblib.load(pb.titanic.model_pkl2())
    # requires scikit-learn version 0.19.1

    To verify that it's the same model we trained above, repeat the cross-validation:

    scores = cvs(model, trainvecs, trainlabels, cv=5)
    scores.mean()

    Expressing data dependencies

    Oftentimes a single Jupyter notebook depends on multiple data packages. We can express data dependencies in a quilt.yml as follows:

    packages:
    - uciml/iris
    - asah/mnist
    - akarve/pydata_book/titanic:tag:features

    In spirit quilt.yml is like requirements.txt, but for data. As a result of using quilt.yml, your code repository remains small and fast. quilt.yml accompanies your Jupyter notebook files so that anyone who wants to reproduce your notebooks can type quilt install in Terminal and get to work.

    Conclusion

    We demonstrated how Quilt works in conjunction with Domino's Reproducibility Engine to make Jupyter notebooks portable and reproducible for machine learning. Quilt's Community Edition is powered by an open source core. Code contributors are welcome.

    Other posts you might be interested in

    Subscribe to the Data Science Blog

    Receive data science tips and tutorials from leading Data Scientists right to your inbox.