Docker, but for Data

by on March 8, 2018

Aneesh Karve, Co-founder and CTO of Quilt, visited the Domino MeetUp to discuss the evolution of data infrastructure. This blog post provides a session summary, video, and transcript of the presentation. Karve is also the author of “Reproducible Machine Learning with Jupyter and Quilt”.

Session Summary

Aneesh Karve, co-founder and CTO of Quilt Data, discussed the inevitability of data being managed like source code. Versions, packages, and compilation are essential elements of software engineering. Why, he asks, would we do data engineering without those elements? Karve delved into Quilt’s open source efforts to create a cross-platform data registry that stores, tracks, and marshalls data to and from memory. He further elaborated on ecosystem technologies like Apache Arrow, Presto DB, Jupyter, and Hive.

A few highlights from the session include:

  • Data packages represent “a deployable, versioned, reusable, element of data”
  • “We looked at the trajectory of compute in the last three years and just saw this vertical rise of containers as a format that not only brought standardization, but costs came way down, and productivity went way up. And the next thing you think about, from a virtualization perspective, is data. Can we create something like Docker for data?”
  • If you have code dependencies, you `pip install -r requirements.txt`, and then you `import foo`, `import bar` and get to work. Data dependency injection is much, much messier. You generally have a lot of data prep code where you’re acquiring files from the network, you’re writing those files to disk, then you have custom parsing routines, and then you run your data cleaning scripts.
  • Having learned from the Facebook flux pattern, looking at React, immutability provides a lot of predictability.
  • The separation of compute and storage, as seen in technologies like Presto DB.
    Instead of measuring how much data you’re collecting, measure how quickly you can get data into code. That’s called return on data.
  • Given the ubiquity of linters and syntax checkers for code, advocates for linters for data to ensure that data conforms to a specific profile.
  • DDL, data description language, is a bridge between the Quilt compiler and the Hive ecosystem.
  • If you have Drill, pandas, Cassandra, Kudu, HBase, Parquet, and all these nodes are really well optimized, they’re high performance. Yet, where do you actually spend your time? Karve indicates that you spend 80% of your time shuffling the data around, you lose all those performance optimizations, and you also have a lot of denormalized, repeated logic.

Video

Video Transcript of Presentation

Nick Elprin:

Alright, thanks a lot for coming to this and we’ll do more happy hour, socializing, and drinking, after this.

I finally met Aneesh Karve a couple weeks ago at Strata. We had a great conversation. Before that actually, I had at least 3-4 VCs I know reach out to me, and say, “Hey, have you heard of this company named Quilt? We’ve met them, I’m super impressed, you really should try to meet these guys.” Everybody was busy so, it’s sort of funny that the first time that we finally end up meeting was actually in New York, but I got a chance to talk to him and see more about what they are doing and I’m super impressed with the product they’ve built, but also just the way they think about the space, about the problems they’re trying to solve, and also thinking about the way they go about the core of the business.

So, Aneesh … I don’t remember if it was his idea or whose idea it was, but somehow we came upon this idea: it’d be great to expose more people here to what they’re doing and to begin to have more of a dialogue as this company has been doing complimentary things and we share a lot of the same principles. Really glad that he was able to come and talk to you guys, given that he’s building a company just like we are and so we all know how busy everyone is doing that.

Please welcome Aneesh Karve, co-founder and CTO at Quilt. Before this he has done a bunch of stuff both at NVIDIA and more recently at Matterport. I think it will be really interesting. Thanks.

Aneesh Karve:

Thanks a lot for the introduction. Really happy to be here. I think you can all be really proud of what you’re building. We were talking to Nick a little bit about the growth stage that you’re at as a company. It really proves–we come out of Y Combinator–that you’ve built something that people want. Congratulations on all of that.

My goal in today’s talk is to give you a sense of how we’re thinking about the data science and the data engineering landscape. To talk to you about some of the open source projects that we’re putting out there, and just really to get people excited about what we’re doing with data packages and what we’re doing with deploying data and just start a conversation, at multiple different levels. It can be at the open source level, it could be at the co-marketing level … I don’t know if Ann’s in the room? Hi, Ann! Nice to meet you.

We’ve spent some time talking about Jupyter notebooks and how to make Jupyter notebooks reproducible, and how to do dependency injection for that.

I’m happy to take questions during the talk, whenever something is confusing or if you’d like me to drill into something, please raise your hand and let’s have a talk about it.

At a very high level, what we are creating we think of as “Docker for data.” And we looked at the trajectory of compute in the last three years and just saw this vertical rise of containers as a format that not only brought standardization, but costs came way down and productivity went way up. And the inevitability that we’re really betting on–the same way that compute has been standardized–the next thing you think about, from a virtualization perspective, is data.

The question is now, how do we get down to a deployable, versioned, reproducible, reusable, element of data? Our proposal and what we’re developing in our open source projects is something called the “data package.” At a very high level, a data package is actual data. We have a data compiler, which will ingest files, ingest database tables, and will then virtualize that data. The package is essentially all of the dependencies that you need to completely give an analysis.

You can put images in it, you can put structured data in it, you can put semi-structured data in it and that package will then wrap and version all of those dependencies. And will give that to you in a consistent way across a variety of platforms.

A couple of important characteristics that we’ve found from making packages work well in both the auditability, and compliance, and reproducibility cases. The way that we think about packages,a package is immutable. One of the things that we see really often in working with data scientists is that you can run the same SQL query 5 seconds apart and get totally different results. And one of the key things that we learned from the Facebook flux pattern, looking at React, is that immutability provides you a lot of predictability. So we actually take the data and snapshot it. And not only is that archival data valuable when you have something like an audit in this case, but you guarantee that when you run against the same tophash (or the same hash) for a given package, you know that data scientists are looking at the same results.

If you’re working with any kind of data warehouse or any kind of database, you don’t get that guarantee because that SQL query isn’t necessarily idempotent because there are transactions happening against the database and the same query may give different results.

The package at its core is virtualized data, which I mentioned earlier, that we serialize the data to Parquet. And we use the Apache Arrow libraries, which I will talk about in just a moment, to get that out of the platform-specific formats, like data frames. I’ll show you that in a moment. The metadata is essentially a hash tree, which tracks all the fragments; and the package is essentially a tree of data. You can think of the data package as a miniaturized file system, which I’ll show you how it works in just a second.

And packaged data lives without burdening any compute resources. One of the things we’re really betting on is the separation of compute and storage, as you see in things like Presto DB. Unlike something like Redshift, where you have to scale compute and storage simultaneously, we believe that data should just live at rest and then we should dynamically spin up compute resources as needed. I’ll show you how that works in just a second.

At a very high level, I want to talk about the lifecycle of a package, and some of the fundamental abstractions, and then I’ll go right into a demo.

There are only three primitives that matter. And we borrowed heavily from existing package managers of code, and from the Docker paradigm. So the first thing is you actually build, the Quilt compiler is what actually builds the package. That package, when you push it, is then trafficked to a registry, which manages permissions, talks to Blob storage, and knows where all the data lives. Then any user who wants to materialize and use that data goes through an install. Then the actual dependency injection takes place in the very last step, which is in the import.

So build, push, and install is kind of like the triumvirate or the interesting set of primitives that you use to build and that packages live through, and then import is when you actually consume the data.

I’ll show you the impact on the ground. So on the left-hand side–these are two different notebooks, it’s not important if you can read the code–what is important is that we think about the difference between how clean dependency injection is on the code side of things, right. So let’s just think about Python and Jupyter for a second. If you have code dependencies, you pip install -r requirements.txt, and then you import foo, import bar and you start using it. Data dependency injection is much, much messier. You have this, generally, a lot of data prep code where you’re acquiring files from the network, you’re writing those files to disk, then you have custom parsing routines, and then you run your data cleaning scripts.

The importance, the significance, and the change we want to see in the world is, instead of people focusing their code, their analysis code, on doing data prep, they should be able to get right to the actual data and start their analysis.

I’m going to show you how we go accomplish this abbreviation here in the demo.

Alright, so let me show you, on the command line, how Quilt works. All the commands I’m going to show you also work directly in Python. You can run these same commands in a Jupyter notebook in a Python native way. To me, it’s most intuitive to show you how this works on the command line because that’s where we use Docker.

The first thing I’m going to do is show you how we inject data into code. So I’ll just fire up IPython. First of all, can everybody read this? Let me blow this up. Is that moderately legible? Interesting, let’s do that. Can you read that on there?

Okay, great. So let’s fire up IPython … well, let’s do something actually with Shell first.

So, first thing I’m going to do is a quilt ls. This will show me all of the packages that are currently resident on this machine. So the same way if I could do a pip list or an npm list, I’ll see which dependencies I have locally. This will give me a sense of what I have. This is essentially the sandbox that we have to play in and now I will fire up IPython and it will do the following.

So, I’ll say from quilt.data.akarve import sales. This really expresses how packages are identified. And that is … you have the author and then you have a handle for the package.

Everything is uniquely attributed to a person and that package has handle. When I do this import, I now get this virtualized file system that I was talking about here. I’m not doing any disk I/O. No data has been loaded just yet. I can see there are two nodes in this packages, README and transactions. If I peek at this transactions item, I’ll see that it’s a DataNode. That becomes interesting when I now add parentheses or add ._data() to this. That will actually do the disk I/O. What I have here is a standard pandas data frame, which is [un]marshalled out through the Apache Arrow libraries.

But the key thing here is that any user or consumer of this package now … instead of saying, “Oh, let me do a pandas.read_excel, let me download this excel file, let me figure out what’s what,” they’re able to import their data dependency and just start working.

The key thing here is not to measure how much data you’re collecting, but measure how quickly you get that data to code, which is the thing we really care about. So that’s the dependency injection cycle.

I want to go a little bit more deeply into how packages are actually created. We try to automate this as much as possible. In the source directory here, I have all of my source files. This could be any number of dependencies, thousands of images, dozens of CSVs, it doesn’t really matter. I have a single dependency here and you can see it’s an Excel file.

The first thing I’m going to do is quilt generate against that directory. And you’ll see, if I now look in the source file, you’ll see it’s generated a “build.yml” file. So YAML is essentially the glue that we use which tells the compiler … So, notably, this auto-generated, I didn’t have to write this file, but the important thing here is that the YAML file tells the package manager how to convert source files into into in-memory data structures. So this is exactly … It’s a little abbreviated but remember, if I went and called this “transactions,” which is what we called it when we imported the data, you can see that you’re now affecting the directory structure that the user actually traverses when they’re using that data. And I have a more finished build YAML file, which I want to show you here because it’s more interesting.

Some of these cases that we think a lot about are A) I mentioned auditing, which I’ll show you how that works in just a second, B) compliance and then, C) measuring things like model drift. One of things that becomes important, one of the things that package are really useful for, it’s a great choke point for applying constraints around the desired shape or the desired profile of the data.

I’m gonna give you an example here. When you have a machine learning model, for instance, that falls down in production and you have a drop in accuracy, it’s usually because something about the shape of the data has changed. So what this check is actually saying here is that I want to assert, at package construction time, that the mean of the “sales” column should be less than some certain number. So now, if I want to essentially check whether or not my data is in shape, or check whether or not my data is in profile, I simply run quilt build on that YAML file and you’ll see that our checks have passed.

Why is this important? You can use this for things like detecting social security numbers … How many people actually work in compliance? So there are things like OFACs, where you’re trying to blacklist certain vendors. And the important thing here isn’t that the code to actually do these checks is difficult, it’s that the package and the time of packaging is a great time to enforce the integrity constraints that we want.

And where we got this idea from is linting. Every single person who writes code, they’re using linters, you’re using syntax checkers day and night. So we should have an equal set of asserts for data that allows us to assert what profile we want the data to be in.

To give you an example of what that looks like when it fails. I actually modify this build.yml file and I drop, let’s say, I say that the mean needs to be less than 1500. I actually run this check, this will actually go out and load the data and then you’ll see that the integrity check fails and the corresponding build process will fail. So that’s just a way to inject integrity and constrain stuff around the point of conception, or the point of creation of your data.

Any questions so far?

Adam:
Creation is important because that’s chokepoint to get to the shared repository for anyone else to use the data.

Aneesh Karve:
That’s really important. Actually, let me jump ahead to that. So, what Adam mentioned is that creation is important because that’s the chokepoint before people actually consume the data. I’m jumping ahead a little bit, but one of the ways that we think about the workflow that you just saw is that there’s this emerging category, something like a data router or a data plane. The responsibility of the data router is not to be a database and not to be a data warehouse but to modularize both of those pieces.

This is how we see a lot of our users using Quilt today. You have whole series of upstream data stores. You’re going to have NASs, you’re going to have SANs, you’re going to have databases, you’re going to have data warehouses, you’re going to have ETL pipelines… And all of those things terminate–the last node in the transformation DAG–is the data package.

This guarantees a couple things. The first thing, it guarantees you predictability because the data flow is unidirectional. You don’t have data scientists just snooping around in the data warehouse and looking at tables that aren’t documented. The registry or the data router becomes the way that data engineers can force the integrity restraints they want around the data and you get this predictable unidirectional flow, which starts at the ingest and then ends where the data scientists, the analyst, the decision makers in the organization want to consume data. Does that make sense? Okay, cool.

At a very high level, all these components are open-source, except for the catalog, but we’re getting ready to open-source that now. I know the text is a little bit small. I’m going to work in the primitives that we talk about (build, push and install) in the architecture.

The data compiler’s job is to actually build packages. Currently we support Python, we’re working to support R and Spark as well. Those are the three languages that we care about the most. That’s on the compilation side of things. You’ll see, when you have a push or install event, that’s simply the compiler or the client talking to the registry. And again, the registry is actually doing the storage management, permission management. Then the catalog talks to the registry. Let me just show you exactly what this looks like.

Let’s take an example here. These are all packages that were created by Quilt users. This guy here is a professor at UIUC and he’s posted a machine learning data package on Quilt. A couple important things. The package actually has a handle associated with it. One of the neat things we found is that, unlike people putting data on Dropbox folder or just dumping data into an Oracle database, people actually start to document things when the data’s attributed.

Here he’s documenting the use of the dataset. Then the compiler, as part of it’s profiling pass, has automatically generated a manifest of the contents. And I’ll talk more deeply in just a moment about where we want to go with that (generating Hive DDLs) so that we can now search through packages using things like PrestoDB.

Importantly, you see one line of code to actually acquire the data, and one line of code to inject it into code. Then I didn’t show you earlier, in the case for auditing, you want to be able to look at the full history of access to the data. And at the very least, now what we’ve implemented is that you can see the full hash history for a given data package. If this Terminal were a little bit wider you’d be able to see that nicely. But the important thing here is that now you have control over … At least you have a full log of the hash history.

So that at a very high level is how the catalog works. It’s just an interface to the registry. And, of course, we index a very limited amount of contents now. We want to work on indexing more and I’ll get to that when we talk about Hive DDLs.

Alright. So that at a very high level is the architecture that we’re playing. Some interesting things that the registry does … Managing permissions we already talked about. How many are familiar with “content aware file systems”?

One of the things that we find is that, for instance in a machine learning example–you may have a system where people are using EC2, or Paperspace, or FloydHub or one of these systems–they traffic the same data over and over and over and over and over again. Only a small number of the fragments change. So what we do as part of the hash tree and the metadata is we only traffic the fragments that have changed. If you already installed … If I do, let’s just see, we’ll do a trivial example here. So I clearly already have a sales data package and if I reinstall, quilt install akarve/sales, the fragment manager, well first of all, it will warn me about overwriting, but you’ll see that it’s all no-ops. It’s no-ops because the registry knows they already have the data. So that’s the real key point of content awareness. Just reducing network, reducing disk traffic. When you look at something like AWS, where EBS is really, really expensive to use and you have a hundred virtual machines in a cluster environment to do distributed machine learning, these costs differences become material. ‘Cause you’re simply using less disk and you’re sending less traffic over the network.

How many people are familiar with Parquet? Okay, cool. So columnar file format … I think we’re the only startup in the world that’s converting Excel files into Apache Parquet right now. The reason we think that’s valuable is it gets us compression and it gets us performance. That performance is largely on the I/O side of things. You’ll notice, if we go back to the demo, I was able to browse the tree without actually materializing any data or marshaling any data into memory.

And so we can do on-disk slicing. If you’re doing any kind of columnar analytics, like if you just want to look up a date, you want to grab a certain date range, it’s much more efficient to skim through that column, you do a lot less disk I/O. There was also an index at the head of that column that lets you do blocks skipping.

Does anybody know what things like Parquet are bad at or which kinds of query are slow?

Audience:
Row-based computations.

Aneesh Karve:
Spot on, yes. So where Parquet will hurt you is if you want to do something like a SELECT *, which is like, “Hey, give me all records in all the columns”, you’re gonna spend a lot of time reconciling all those rows.

So the reason columnar data formats compress really well is you can do really good run-length encoding, ’cause if the cardinality of the column is small, you have really great compression and dictionary encoding. Where it hurts is, you now have, let’s say you have a hundred columns wide and you want to a single row, well you have to go and unpack a hundred columns. It’s an optimization, it isn’t perfect for all opportunities.

How many people are familiar with PrestoDB? Okay, cool. We’re huge fans. Facebook wrote the original version of Hive SQL, which was kind of like SQL on top of HDFS. So Presto is its successor. We’ve been super impressed with what it does. The reason it’s significant to Quilt is that, when you build a package archive, you’re essentially building up data lake of Parquet fragments. The question now becomes, “Well, many data packages, many datasets aren’t installable onto an individual machine, how do you rip through that ?” And Presto is kind of our leading candidate for how we do that.

The magic of Presto is actually streaming join, so it tries to land as little to disk as possible. It’s really good … It’s SQL compliance is fairly strong. If you have a small table and you want to join it to a large table, it will do really well.

Does anybody know where Presto falls down on its face? So remember, it’s streaming, if you want to join two large tables, you’re kind of in trouble. You need more specific techniques, like distributed hash joins.

The thing we love about it though is the separation of compute and storage. I can have this huge data lake in S3 and it’s costing me nothing. Compare that with something like Redshift where, as I scale-storage, I have to scale-compute, and those things are welded together, you can just build up this huge data lake and worry about how to rip through it later. And it’s all schema-on-read. And I’ll show you … Okay, great.

So the secret sauce of how Presto works is it’s designed to query on top of files. So there’s no … I mean, you can index or partition in some sense, but there’s no standing database resource, it’s designed to perform schema-on-read. Here’s how it works: so I mentioned this concept of a DDL, data description language, earlier, which is kind of a bridge to how the Quilt compiler will talk to the Hive ecosystem. The key thing here is that a DDL is actually not that complicated. So when you create a SQL table in Presto, all you’re doing is declaring the column types, which has to do with the profile of the data. You’ve got bigints, you’ve got varchars, you can even parse things like JSON, you can have arrays … And the DDL will become significant in a moment, but the first thing here is that this is the bridge between data that just lives in blob storage and actually being able query it.

And if you notice, if you remember earlier, if you think about the Quilt compilation pass, we’re already profiling the data. And so the next step, one of the things we want to get to, is to actually write the DDL at the time when you compile it. This means two things. First of all, it means that all the data in this kind of blobby, S3, Parquet data lake is queryable. The second thing that it means is that HDFS isn’t really well permissioned. If you use Hive, your data scientist would just go in there and see everything. Some data engineers don’t want that.

The last thing I want to talk about, I’m almost done, is Apache Arrow. How many people heard of Wes McKinney? Okay, so Wes wrote pandas. He’s almost single-handedly responsible for making Python a serious data science language. And he wants to … You can get away with about a one gigabyte data frame in pandas today, he wants to get that to ten gigabytes. Arrow is gonna be part of the magic of what makes that happen. I’m going to talk about why Arrow is important to Quilt and just how it works in general.

It’s designed for columnar in-memory analytics. The first really important thing it gives Quilt is that, as we expand to other platforms, we wanna make sure that we have absolute hash integrity which is not subject to endian-ness. We want to make sure that we have a platform invariant way of generating the hash of a given dataset or given data package. Arrow is the way that we do that because, if we hash the Arrow data frame, that data frame is going to be the same on a variety of platforms. The other thing which we use Arrow for, there’s this term, “SerDes,” which is serialization and deserialization. You’ve got all this Parquet sitting on-disk, we wanna suck that up into this pandas data frame so the deserializers that actually read that data and unmarshal it into memory, is all through Arrow.

One of the good Twitter conversations I had with Wes … How many people touch machine learning? I guess everybody here. Well, maybe not. Okay. So who’s familiar with this concept of feature engineering? Okay. So the idea behind feature engineering is like, “Hey, I’ve got this pile of data, how do I subset that down into a set of features that an algorithm can digest?” One of the things that Arrow will make possible is we think most of the core feature engineering will happen in data frames. But does anybody see the problem? So let’s say I do all my feature engineering in data frames, now I need to train a classifier, what’s my problem? Can I give … Can I be like, “Hi TensorFlow, here’s a data frame.” TensorFlow’s going to be like, “No, go away.”

So the idea here, what Arrow will make possible, is that we can dynamically reshape that data frame. We’ve got this high performance in-memory data frame where data scientists are really comfortable working with it, but now we need to inject that into a classifier. We can be like, boom, here’s a Tensor, boom, here’s a file, boom, here’s a Protobuf. Whatever the format is that’s important. Whatever the native language or native format is to your platform.

This is a huge part of what we want to do with data packages, is separate, insulate data scientists from the concern of how they’re was data was stored. That’s the key thing that packages do, is virtualize that data.

The other thing it gives you is zero copy reads. So you’ve all seen this diagram before. So you’ve got Drill, you’ve got pandas, you’ve got Cassandra, Kudu, HBase, Parquet, and all these nodes are really well optimized, they’re high performance. Where do you actually spend your time? You spend 80% of your time shuffling the data around and you lose all those performance optimizations. And you also have a lot of denormalized, repeated logic.

The idea behind Arrow is that we will have this single columnar in-memory format that then we can marshal out with zero copy because the memory format is the same, no matter which platform you’re on, to a variety of different platforms. Quilt is in some sense a service that wraps Arrow and does the cataloging, does the computation.

Let’s see. So a couple of the high level things, and hopefully this will be more interesting for the marketing people in the room, and the people who think about what the value drivers are, for a platform like Domino.

One of our surprising insights, or one of the things we believe that nobody else does, is that when you manage data like in source code, that’s what packaging data is all about. You couldn’t do source code management without versioning and packaging. Why do we try and do data management without those things?

And, in that … One of the things Nick and I talked about at Strata was this concept of “return on data.” We know what “return on investment” is. So people are collecting all this data, how much of it are they actually using? So return on data, we break down into five high-level benefits and then I back those out into actual technical features we have on the platform.

So, discoverability, right? So that’s compiling the data and cataloging it. Reproducibility comes from versioning and packaging. And now, when I want to inject a dependency … Let me pull up Jupyter. So let’s say I wanna depend on 300 files or 300 data frames. I mean, this is my dependence injection. That’s it, it’s a one-liner. And now I automatically have all the dependencies that I need for my analysis.

Auditability comes from logging, quilt log. Security is, again, this model here, where you have this data router or data hub. A lot of the services that are upstream of the hub are security model agnostic; Hive is a great example. So you get to enforce permissions at the registry layer. And compliance, like we looked at linting. You might want to ensure there are no social security numbers, you want to make sure you’re not working with any blacklisted vendors. There’s a whole series of hooks that we can provide in managing data like code.

Couple of interesting insights and then I’ll take questions.

The people who benefit from using Quilt, or from using data packages, are data scientists. They also don’t care about Quilt. One of the things we found is that data scientists are not systems engineers, they’re not software engineers, and even though they’re the beneficiaries of the efficiency improvements, they’re too busy trying to get the science to work. So time and time again, we’ve seen that the data engineers who actually think about software engineering and code quality in systems, and they’re the ones that bring the infrastructure goodness to the data scientists.

I showed you earlier how you get better documentation when you have this collaborative semantic, where all the data is attributed and findable by the person sitting next to you. Part of the inevitability that we’re betting on, why there’s a standard unit for compute and why there must be a standard unit for data, is that the behaviors are already in place. People are already building, pushing, and deploying code everyday. There should be this data equivalent happening.

We’ve also heard from customers that Git is an undesirable place to put their data. There are a couple of reasons for that. Even if you use Git LFS, it doesn’t handle serialization. And all the prior art, if you just copy the data and just copy the files, you’re not doing all of the work, you’re missing everything that matters. And we’ve seen performance concerns around Git and Git LFS. And the biggest thing is you wanna be able to articulate your code and your data, independently, you should have separate hashes for those, you generally want different histories for those as well.

And then, of course, this idea of the data router or data plane category. If you check out the HortonWorks blog, they’re talking about that as well.

So a little bit about future. Generating data descriptors at compilation time, so that we can support the second one, which is server-side queries to work with enormous packages. The top five technologies we wanna work with are Jupyter, I showed you a little bit, I’m happy to show you more about how Quilt makes data dependencies for Jupyter for manageable, more reproducible. TensorFlow for machine learning. Airflow, again, in that diagram where if you had your data router, Airflow is doing a lot of the transformations that lead into packaging. Presto and Docker. So, again, we’re Docker for data, not data for Docker, but it should be easier for the average user to spin up a VM and just get their working set of data. And if you actually try to do that today, just open an EC2 container, you’ll be like, “Oh, let me go into this folder and SSH this set of directories.” It’s really not … It should get a lot better. And then bringing cross-platform support.

So I’m quite sure I talked about a lot. I’d love to hear about what’s interesting. I’d love to hear about how this relates, or not, to the work you’re doing here and I’d be happy to take any questions.

This transcript has been edited to improve readability.

Share