Data Science Platform: What is it? Why is it Important?

best practicesdata science platformknowledge management
Share

As more companies recognize the need for a data science platform, more vendors are claiming they have one. Increasingly, we see companies describing their product as a “data science platform” without describing the features that make platforms so valuable. So we wanted to share our vision for the core capabilities a platform should have in order for it to be valuable to data science teams.

A framework: the data science lifecycle

We see “the data science lifecycle” spanning three phases. Each phase has distinct demands that motivate capabilities for a data science platform:

  1. Ideation and exploration
  2. Experimentation & model development
  3. Operationalization / deployment

To some degree, all data science projects go through these phases.

  • Additionally, there is an underlying perspective that spans across this lifecycle: the need to manage, scale, and evolve your analyses and your organization’s capabilities.

Data science platforms and the data science project lifecycle.

We’ll discuss these four lenses, describing the challenges involved in each, and what capabilities a good data science platform should provide.

Ideation and Exploration

Quantitative research starts with exploring the data to understand what you have. This might mean plotting data in different ways, examining different features, looking at the values of different variables, etc.

Ideation and exploration can be time consuming. The data sets can be large and unwieldy, or you may want to try new packages or tools. If you’re working on a team, unless you have ways of seeing work others have already done, you might be redoing work. Other people may have already developed insights, created clean data sets, or determined which features are useful and which are not.

A good data science platform should make it easy to…

  • Find and understand past work, so that data scientists don’t need to start from scratch when asking new questions.
  • Explore data on large machines, without dealing with dev ops / infrastructure setup.
  • Use new packages and tools, safely, i.e., without breaking past work or disrupting environments for colleagues.

Experimentation & model development

Through the process of exploring data, researchers formulate ideas they want to test. At this point, research often shifts from ad hoc work in notebooks to more hardened, batch scripts. People run an experiment, review the results, and make changes based on what they’ve learned.

This phase can be slow when experiments are computationally intensive (e.g., model training tasks). This is also where the “science” part of data science can be especially important: tracking variations in your experiments, ensuring past results are reproducible, getting feedback through a peer review process.

A good data science platform should make it easy to…

  • Scale out compute resources to run many computationally intensive experiments at once.
  • Track your work (i.e., your experiments) so they are reproducible.
  • Share work with peers and non-technical colleagues (with other areas of expertise), to get feedback on evolving research and results

Operationalization

Data science work is only valuable insofar as it creates some impact on business outcomes. That means the work must be operationalized or productionized somehow, i.e., it must be integrated into business processes or decision-making processes. This may be in the form of a predictive model exposed as an API, a web application for people to interact with, or a daily report that shows up in people’s inboxes, for example.

Operationalizing data science work often involves engineer resources, which increase costs and delays time to market. Not only does this affect initial deployment, but if it becomes a constant tax on new iterations, it makes it more likely that models never evolve past “version 1”, or it means that by the time new research makes it into production, it’s already obsolete.

A good data science platform should make it easy to…

  • Publish models as APIs, so software systems in other languages (Java, PHP, C#, Ruby, etc.) can easily consume them without re-implementation.
  • Expose work to non-technical analysts and stakeholders — as self-service tools, dashboards, static reports, etc

Manage, Learn and Scale

In addition to helping researchers develop better models faster, platforms also bring a critical capability to teams and to managers. As companies invest more in quantitative research, they should build institutional knowledge and best practices to make the team even more effective over time.

A core value of a platform is its ability to centralize knowledge and research assets. That gives managers transparency into how people are working; it reduces key-man risk; it makes it easier to onboard people; it improves shared context, thus increasing creativity; and it accelerates the pace of research by making it possible to build upon past work rather than starting from scratch.

A good data science platform should make it easy to…

  • Browse or search for past work (including meta data, code, data sets, results, discussion)
  • Put security boundaries around work, to ensure that it’s accessible to the right people
  • Build upon past work

Open vs Closed Platforms

We believe a platform should facilitate the process we described above, ultimately making it easier to build models or other data science deliverables. A relevant question that comes up comparing platforms: what tools can be used on top of the platform?

Some platforms are closed: that is, they facilitate the processes described above, but only if you are using the platform vendor’s particular programming language, modeling packge, or GUI tool.

We built Domino to be an open platform, committed to the goal that you should be able to choose the language(s) and packages you want to use, and that a good platform will give you that flexibilty.

In our experience, an open platform is valuble for two reasons: First, it creates agility for your data scientists, making it easier for them to use the right tool for the job in different circumstances, and to experiment with new packages or languages as desired. Second, it reduces friction to adoption, because data scientists are free to keep using the same tools they’re accustomed to.

Conclusion

We hope this provided a helpful framework to readers who are investigating data science platforms. We plan to write more about our observations on this emerging landscape in the next couple of months. Stay tuned!