Wes McKinney, Director of Ursa Labs and creator of pandas project, presented the keynote, "Advancing Data Science Through Open Source" at Rev. McKinney's keynote covered open source's symbiotic relationship with data science and the importance of community-led open source. This blog post includes distilled highlights, the full video, and transcript of the keynote.
Wes McKinney’s “Advancing Data Science Through Open Source” keynote at Rev provided insights into the development of open source over the past decade, open source's contribution to the modern data science stack, and why community-led open source is at a critical point. McKinney advocates for data scientists, engineers, and organizations to contribute to community-led open source.
A few highlights from the keynote include
- understanding how code reuse, collaboration, and "permissive licensing" within Python, Hadoop, and Spark open source projects are linked to the growth of modern data science
- delving into "customers" versus individuals and organizations that benefit from open source
- how open source provides additional transparency and increases the expectation that academic, science, and research results be reproducible
- why and how community-led open source projects differ from projects led by a single company or a small consortium of companies
- impacting open source by contributing time, providing developer recognition, or funding developer-oriented contributions
For more insights from this session, watch the video or read through the transcript.
This section provides a written transcript of the keynote presentation as well as the questions and answers that arose after the keynote. This transcript has been edited for readability.
This talk is an important discussion that's been happening in the last two or three years, but it's also very personal for me because I've done many things, but in large part, I've made a career of building open source software. I'm now in my 9th year of building open source. I would say this talk has more questions than answers, but these are the things that I think a great deal about in terms of how we produce open source software and how we can better support open source developers.
A subtitle for the talk would be "Community-led Open Source and You," and I will explain what I mean by community-led open source and we'll dig right in. Having a nearly decade long perspective on looking at the open source world, it's pretty amazing all that has happened. GitHub is ten years old this year, which is kind of mind boggling when you think about the way that open source developers collaborated ten years ago. People used to literally send around patch files and then they would send code reviews in the form of comments on patch files in email form. It's become easier than ever to start the open source projects and to collaborate and to make really interesting things happen through the synergy of bringing people together in a very lightweight way on GitHub.
Open source has had a really major role in the development of what we now can call the modern data science stack. It's really interesting to think about what might have been in the field of data science were it not for all of the open source projects that developed in the Python world, the R world, in the Apache Hadoop ecosystem, the Spark World, but it's interesting because one didn't exactly cause the other. Some people are like, "Well, you know, maybe open source projects precipitated the rapid growth of the field of data science or maybe the other way around."
The reality is a great deal more complicated than that.
It was really a symbiotic relationship in that people are collecting and analyzing more and more data. They need software to analyze that data and the open source world community miraculously was able to rise to that need at a critical time. A big part of that is the collaboration and code reuse aspect where you can pick your cliché, whether "the best code is the code you don't have to write" of "great artists steal", I'm sure there are many other such aphorisms about reusing code. The fact that permissive licensing allowed projects to be able to cut and paste code in a safe way and build new projects and to be able to boot strap themselves off the work of others in a very rapid way helped progress happen, happen very, very, very fast.
Obviously with the rise of the mobile and social web, companies scrambling to get more people to click on ads and instrument their apps to collect extraordinary amounts of data, needed software in able to collect, manage and analyze and visualize and produce models to act on all of that data. They just needed to scale so fast that they couldn't wait around for commercial software vendors to build software to solve those problems. That's why you saw Apache Hadoop came out of Yahoo back when Yahoo was still a thing. Facebook was the original creator of Apache Hive for doing SQL on Hadoop. Many of the big internet companies created these in what became really important open source projects because they had to work really fast to solve the problems that were right in front of them and they felt that releasing those projects as open source would help the progress and innovation happen faster than it would otherwise.
Another factor that I think a lot about is the interplay between closed source licensing models and cloud computing. Used to be that you would have a server room and whatever is in the server room you'd have to call up your IT guy and say, "Hey, you know we need to buy some more server racks to increase our ability to do work loads," but with AWS and Cloud Computing you can very easily go from ten nodes to a hundred or thousand nodes. Can you imagine if you had to call up the MathWorks or the SAS Institute and buy a bunch of licenses to be able to scale up your analysis in the Cloud? Having open source software to be able to, of your own free will, install as much of the software as you want without having to, in a friction freeway, was a big factor in driving data science to open source.
Another major factor is just problems with reproducible research results in science. There have been now at this point many high-profile cases where there were errors made in analyses, whether it was an Excel error causing some part of the subprime mortgage crisis or different kinds of problems reproducing major results that impacted public policy or impacted decisions in medicine or other areas. In academia and science, in research labs, there's been a push to increase transparency and to make as much of the science reproducible. Whether it's "Here's a Jupyter notebook," which has everything from top to bottom. "Here's how I got the data. Here's the libraries I used to load and preprocess the data. Here's my model. Here's my results and my data visualizations." I do hope we see more and more in that, that there's this social expectation in science that you want to publish an analysis, where's your Docker file? Where are your data files? How can I reproduce your results myself and see maybe how you cherry picked? or how you have problems with your analysis that make your results less compelling?
Another major trend that's going to effect of the growth of open source is the challenge of making open source software work in the enterprise. You could argue this is a major reason why all of us are sitting here in the house of Domino Data Labs, one of many companies who are helping make open source work in big companies. We need all of these tools to enable governance and cooperation and auditing. What if somebody does a bunch of research and then they quit, where is all their work and how do we keep track of that? There's all these problems that happen in big companies and in large part, open source developers, these are not the types of problems that many open source developers would solve on their own in their free time, particularly when you consider where a lot of the open source developers are and what kind of problems they're trying to solve themselves.
With all this in mind, I think a lot and I talk to a lot of companies about just thinking more about where the software comes from. Who builds it? Who are the individuals who wrote the code? Who's maintaining the projects? Who's managing the flow of code from outside contributors, people making pull requests and who decides what code ends up in the master branch? Who's managing releases? Essentially, how is the sausage made? It's one of those things, the more you ... It's kind of like the meat industry. The more you learn about the meat industry, the less you want to eat meat. The more you learn about open source software, well, I'm not sure what would be the right analogy, but the more you learn, the more concerned you get about how it all comes together.
A part of it is who paid for it? If somebody spent time on the project, were they being compensated for that? Was it their nights and weekends? I'm just going to put some air quotes around the paid thing because this ends up being a complex subject. There's many kinds of open source projects. Not every open source project is created equal and I don't know what other right words to describe these projects, but I kind of broadly, and this is an oversimplification, put the projects into two different buckets, industry or corporate-led projects and community-led projects.
Industry, or corporate-led projects typically are projects that were started by a single company or perhaps a small consortium of companies. Sometimes the code base started out as a proprietary code base that was open source later, whenever there was a market need or maybe an open source project came along that threatened that proprietary project and then they decided to open source it. Maybe the project was conceived as an open source project from the get go. Great example of this type of project is TensorFlow from Google where TensorFlow is being used internally at Google as part of Google's plan to rule the world. They felt like having everybody use TensorFlow for machine learning and deep learning and making it easy to use on Google Cloud is part of their development strategy.
Community-led projects are different. Sometimes they have very diverse origin stories. Sometimes they're started by ambitious individuals with an itch to scratch or some kind of vision. I put myself into that category. Some of them are started by people in research labs where they are government funded research labs like scikit-learn for Python has been sponsored by government funded research in France and many other things, but that's where it got started. Matplotlib and a number of other Python projects were sponsored by different government research programs. Sometimes the work has been done in universities. When you look at the R world, the initial R world came out of indirect support from statisticians wanting to build free alternatives to the S programming language, which was proprietary. It's interesting because in any different ecosystem you might have a higher composition of community-led, or projects that came out of organic growth, or projects that were led by a company with a particular agenda.
One of the challenges, and there are many challenges that come up, is you think about what incentivizes software developers to create high quality software? Open source software projects often will face criticism about the level of professionalism in their projects, whether it's issues with packaging, issues with documentation, issues with testing, packaging, continuous delivery. There's many different problems. I find that in community-led projects, a lot of the kinds of things that people think about in terms of professional enterprise grade software engineering goes by the wayside and the problem there is that the incentive structure is just not there to support that work. When you can choose between adding new features and fixing some critical bugs, there's a lot of different kinds of work that would be really great to do if we had infinite people and infinite resources, but it's work that goes by the wayside.
Whereas if you have a commercial software project or even industry led ... Let's say you consider TensorFlow versus all of the other competing AI frameworks led by Facebook, Microsoft, the other major tech companies. They have an incentive to make TensorFlow, for Facebook to make PyTorch, to make the most professional documentation, the best testing, the software people can put their trust in, but it is not always so easy for a project to do that given their level of resourcing.
Another thing that's complicating this for people who build data science or data processing projects in particular is that the way things are going in the world with all the issues around data privacy. Just having trust and transparency in the software is that, more and more, there's the expectation that if you're going to build a data science project, you want all the software that you're using to be open source so that you can, if you need to, look at how it's implemented and understand it.
You can see if your data's being used inappropriately in some way and if you were sending sensitive data into a black box that you don't know anything about, that could create some concerns and so for people like me and people who build open source software for a living, the fact that we might like to build more closed source software to be able to generate revenue to support our work, but in many cases people don't want a lot of the software to be closed sourced or to have a business model associated with it because of these issues around trust and transparency and the freedom to reuse the software and make changes.
No type of project is free of problems. You do see issues in projects that are primarily led by a single company or multiple companies. You will see companies change strategy and basically abandon projects or they're going to move the developers to another team. If a startup is building an open source project, you may see that startup get acquired or that startup fail and the projects gets similarly abandoned and there's other problems that can occur there.
The problems that occur in community-led projects are a little bit more insidious and concern issues around developers and maintainers burning out because there isn't enough support to enable them to keep up with the work of building the project. These problems become worse the more popular a project gets. You would think that the more users you have, the more maintainers and core developers you have ... In my experience, this isn't really true. You did get more core developers, but it's not a linear growth. You would think with a project like Pandas which has millions of users that we would have a 100 core developers or maintainers making the project work, but really it's in the single digits. It's under 10 people who are slaving away making the project work on a week to week, month to month basis.
Part of this is the circumstance that enables an individual to be maintainer of a project, or a core developer where they're spending 50% or more of their time turn out to be very special situations that are difficult to come by. Recently there was a tweet that went viral. This slide's even a little bit out of date that in analyzing the Pandas project, the NumPy project and the Matplotlib project, if you look at who's actually at the bottom of the funnel who's managing the code that's goes into the project and making the project work, it's a shockingly small number of people.
The caption was "We rely on 15 people to do our science. Without them these projects would basically not be maintained." This is the kind of stuff that keeps us up at night. While we have a huge number, the long tail of contributors has grown immensely and we've had over a 1000 unique contributors to Pandas, just alone, but the actual number of people that are really making the project work is a shockingly small number of people and it's a number that's not growing very fast.
This is also complicated by the relationship between users and developers which can also grow quite toxic, particularly the more popular a project becomes. We have found kind of anecdotally that the users stop thinking about the people producing the software as people. They just assume the software exists and it is built to their level of satisfaction. There was an exchange with a user on the Pandas mailing list some time ago and he described himself with the word "customer". I had to kind of stare at it and try not to overreact and I was like, "Well, I'm going to take a break and not reply to this email for a little while." I did end up replying in a more civil tone, but I think my initial reactions was, "Sir, I believe you are mistaken."
There are a number of myths that surround these community-led projects. One is this idea of organic growth and organic contributions that if you want to see progress and innovation in project ... Many of you in your companies, you didn't use all of these open source projects that you use now, you weren't involved in the creation of them. You started using open source at some point you're like, "Oh well, what should I use? OK, I'm going to install all of these things," or "Install dot package all of these things."
If you want things to get better, maybe all you have to do is wait, and wait for the organic process of open source to make things better, magically spring forth out of the earth. I find the process of building the software is not very random and frankly very repeatable, so the types of people that are gravitating to work in the Python ecosystem or work in the R ecosystem are quite different from the people who were attracted to the ecosystem ten years ago when things were not popular. When there was a feeling of green field development and forging new ground. Building something from nothing. The process that generates software now is quite different.
Another problem that I see is this idea that burnout is an exaggeration or not as big of a problem as it's reported. The idea that if maintainers or developers burn out, it will be OK, because if they burn out, other maintainers will spring up as part of the random process to replace them. That may be that if some poor maintainers burn out that other people will step up into a leadership role, but more often than not, they would be stepping into a role that was already overburdened and so they will find themselves in the same situation that was causing burn out in the first place. Really what we've got to figure out is how to have five times as many core developers and maintainers in these projects.
Another issue I think comes from the attitude of enterprise software engineering where engineers are thought of as fungible assets and you would be remiss to plan a software project that has bus factor of 1 or if this person leaves the company, this whole project is going down the tubes so we need to be able to be resilient to people coming, entering and leaving a project. It is true that many enterprise projects compared with open source projects, they may be more straight forward, like building a website or building something that has very well-defined plan on how to build it whereas open source projects that are building something new and innovative requires a different type of engineer to really drive forward.
The sad reality and this is some data that the company Tidelift collected that a lot of open source work today has no funding or people are being funded maybe 20% of their time to work on community-led projects as part of their work. Even that can be complicated. Another problem here is the breakdown between maintenance work and innovation work in open source projects. Projects start out with this rapid innovation and change and at some point, when they become very popular, the maintainer, the core developers turn into maintainers where they are essentially treading water to maintain the status quo. That also can be a source of overburden and burn out for the developers.
This was brought to Ford a couple of years ago, maybe it was three years ago, and the Ford Foundation wrote this report called “Roads and Bridges: The Unseen Labor Behind our Digital Infrastructure.” It talked about the Heartbleed bug and OpenSSL and a number of other projects. A lot of people were shocked how under maintained OpenSSL, such a key component in how the internet works could have shocking bugs causing people's secure information to be exposed to malicious parties on the internet. These were the kind of bugs that could've, if there were more rigorous testing and fuzz testing, these were problems that could have been identified in the project if sufficient effort were put in.
One thing, I don't hope to convince you of this, but I believe it, is that community-led projects are very important to you. You may use a lot of projects that are produced by one of the major tech companies, but the way community-led projects operate is different and gives you more of an opportunity to impact the roadmap and direction of the project to make meaningful contributions and help drive them forward in a way that's beneficial to your future work.
Developers do try to find ways to fund their work. The major ones are to find a corporate benefactor and that can fail in some ways if that benefactor decides they would rather have you work on a different project and not reward you for your open source work. Developers will do consulting to support their work and that creates a tension and I've done this as well. It creates a tension between hustling for a consulting contracts to pay your bills and spending time on the underlying open source, which is kind of driving the consulting work. My experience has been that you end up not having enough time to focus on the underlying open source project.
I've been very involved in the last couple of years with the Apache Software Foundation. I just became a member earlier this year, which is quite prestigious. I'm a big fan of the Apache Software Foundation because it provides a framework for corporate-led or industry-led projects to bring community governance and community process to projects that might otherwise be dominated by a single organization. Projects that maybe didn't start out as community-led can become that way by adopting the Apache process and joining the foundation.
There's key principles the ASF operates by around how decisions are made in kind of an open and transparent way. We operate on the basis of consensus. We allow people to gain influence of a project through making contributions. There's a very straight forward way if you want to become influential in a particular Apache project. You can do that through making contributions to the project. In many projects, particularly if a project is run by ... If you wanted to go and make major changes in TensorFlow, you would in a lot of cases go work for Google to be able to make major changes to TensorFlow, whereas in a community-led project, if you contribute a great deal and gain influence in the project, you can get a seat at the table and be a first-class developer.
I do worry that maybe we're at this critical point and things could go one way or the other. That maybe all the community-led open source projects will fade away and the new model will be all of the new open source is produced by top ten tech companies. It's actually a pretty realistic future possibility that open source developers say, "We can't get anyone to fund our work." Really, it's only Google and Facebook, Microsoft and Netflix and Amazon that are going to be the ones to produce open source software that we use. That prospect does make me a little bit sad. There's the movie Demolition Man in 1995 where all restaurants are now Taco Bell. I hope that doesn't happen to open source software, but I know I'm running out of time, but I'll take maybe two, three more minutes of your time.
There are ways you can help. The two major ways that you can help make things better for open source developers. The first is giving your time. If you have engineers who have a project that they use, they want to contribute to, they're passionate about, to give them the time to do that ... Not only to give them the time, but to reward their work and their contributions to the project equally to their work, building projects that are internal. A problem that I've seen is sometimes people say, "Hey you can work on this open source project, but there's this really important project that's important to the business, so you can make your choice, but this project's going to get you promoted and this open source stuff, well it's at your own risk." Don't do that.
Another way is if you don't have the engineers to make contributions is to give money. Turns out giving money to open source projects is complex and just knowing where to give the money in the Python world and beyond Python, but starting in the Python world there's a new organization in the last few years, NumFOCUS, which has provided a financial conduit for donating money to support projects like Pandas and that's helped us to do quite a lot of things. I've just created a new organization called Ursa Labs to raise money to do development in Apache Arrow oriented at the data science world. I've been working with the Arrow project for the last three years or so. We're at an inflection point where we really need to build a larger and dedicated team to build out. Better computational infrastructure for data science, so I think over the next few years we'll see more organizations like this, that provide a more straight forward conduit for money for organizations that have it, to teams of people where we can put people to work, working full time building open source software.
The last thing since I'm completely out of time. They're going to kick me off the stage is that there are many success stories and proof that even small amounts of funding and that individuals can make a great impact. I really love this story from Nathaniel Smith who's a long time Python developer. He was funded by Berkeley BIDS for two years to work on NumPy and a number of other Python projects. Working by himself for two years with a very modest amount of funding. He could have made more money working for a tech company in downtown San Francisco, but he chose to work for Berkeley BIDS because that was his passion. He alone was able to make big contributions in this space. We dream about what might be possible if we're able to do ten times as much as we've been able to do in the last five years.
Thanks for your attention. Thanks for using open source.
Questions and Answers
Speaker 2: Thank you very much for the talk and the great work. My question is, do you think like, I see a lot of companies now they have developer advocates where they work on open source projects, do you think that's a way to help the open source community, keep maintaining all the software and all the work?
Wes McKinney: I think it does. When companies contribute to open source projects, it goes beyond just charity. I think there are other benefits in building your company's technology brand. It's part kind of feel good, we're making the open source world better, but it's also marketing for your organization, that you're doing good in the world and you're supporting these projects. It's a win-win. I think that developer advocates, just in terms of lobbying for ... because engineers who want to contribute to open source projects may not be the best advocates for themselves in kind of arguing to management about why those contributions matter and it will make them happier and less likely to churn and move to a different company. I think developer advocates do help a lot.
Speaker 3: So with how long some of the, take Pandas as an example, it's been developed for years and I've talked to some people who've expressed interest in doing open source stuff, but they don't actually know how to get started with, how much up start time there might be in things like this. Do you have any advice on how to get started helping with these types of projects?
Wes McKinney: Yeah. It varies from project to project, like how easy it is to get started. I know on Pandas in particular, we tried to curate the issue tracker and identify work that would be accessible to newbies, to help people get started. There are hackathons, so there'll be hackathons or sprints at conferences. That's a great place where you can go meet the developers who are involved in these projects and often the sprints or hackathons will be oriented at helping onboard new contributors and helping people write their first pull request. Recently we had a worldwide documentation sprint for Pandas which yielded hundreds of pull requests in a 24-hour period. It helps to have some kind of mentorship or at least engage with the development team. I always tell people if they're not sure where to start, reach out on mailing list or if you find an issue that looks interesting to just ping the developers and say, "Hey, I'd love to work on this. How can I get started?" It varies from project to project. Maybe the Linux kernel might be a little bit hard, but I think a lot of projects are generally welcoming and would like to help people make their first contribution and get involved, even small contributions. There's no contribution too small.
Speaker 3: Thanks, Wes.
Wes McKinney: Thank you.
New to Domino? Consider a Guided Tour.Watch a Demo of Domino
Recent PostsSnowflake and RAPIDS For On-Demand Computing by a Storm Parallel Computing with Dask: A Step-by-Step Tutorial Lightning fast CPU-based image captioning pipelines with Deep Learning and Ray Everything You Need to Know about Feature Stores 5 MLOps Best Practices for Large Organizations Choosing a Data-Governance Framework for Your Organization Transformers - Self-Attention to the rescue How data science can fail faster to leap ahead N-shot and Zero-shot learning with Python A Hands-on Tutorial for Transfer Learning in Python
Other posts you might be interested in
Subscribe to the Data Science Blog
Receive data science tips and tutorials from leading Data Scientists right to your inbox.