The “Joel Test” for Data Science

best practicesdata science
Share

It's the sixteenth anniversary of Joel Spolsky's "Joel Test," which he described as a "highly irresponsible, sloppy test to rate the quality of a software team."

Back then (the late 1990s), software development was:

  • Being recognized across industries as an invaluable capability for improving business outcomes;
  • Undergoing a change from solo practioners and small teams to large collaborative teams;
  • Undergoing a rapid evolution of best practices and tooling to support its practitioners;
  • Heavily in demand, creating lucrative job opportunities for competent practitioners;
  • Eating the world.

We think data science is going through a similar phase of evolution and maturation, so we thought it would be helpful to write something like the Joel Test for assessing the maturity of your data science program. It's our "highly irresponsible sloppy test to rate the quality of a data science team."

Here's our first draft, let us know what you think:

The "Joel Test" for Data Science

  1. Can new hires get set up in the environment to run analyses on their first day?
  2. Can data scientists utilize the latest tools/packages without help from IT?
  3. Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?
  4. Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?
  5. Does collaboration happen through a system other than email?
  6. Can predictive models be deployed to production without custom engineering or infrastructure work?
  7. Is there a single place to search for past research and reusable data sets, code, etc?
  8. Do your data scientists use the best tools money can buy?

These are not the only factors that will determine success of your data science program. For example, the questions above don't cover anything related to the connection between data science work and business drivers ("do all your data science projects have a clear business goal and engaged business stakeholders?"). And you still need great people on your team.

However, if you answer "yes" to all or most of the questions above, then you're working in a way that makes good outcomes much more likely.

1. Can new hires get set up in the environment to run analyses on their first day?

We've seen organizations where it takes over a month for a new data scientist to even begin contributing. Onboarding can be delayed because new hires spend time getting the right software installed on their computer; finding and getting access to the right versions of internal resources (code, data sets) to use; and learning how to follow internal processes.

2. Can data scientists utilize the latest tools/packages without help from IT?

There is a flourishing ecosystem of open-source tools for data science. No single tool will be a panacea—rather, organizations will be most effective when they are agile enough to experiment with new tools and techniques. To that end, trying a new package should be possible at the speed of your natural research process, rather than becoming a bureacratic IT approval process.

3. Can data scientists use on-demand and scalable compute resources without help from IT/dev ops?

As data volumes grow and data science algorithms become more computationally intensive, it's critical to have access to scalable compute resources. As with the point about packages above, research will progress faster if IT or dev ops processes aren't a bottleneck for data scientists.

4. Can data scientists find and reproduce past experiments and results, using the original code, data, parameters, and software versions?

The first question of the original Joel Test is "do you use source control?" In our experience, source control is necessary but insufficient for robust data science, becaue source code alone is not enough to reproduce past work. Rather, we think it's important to have a record of experiments—including the results, parameters, data sources, and the code that were used to produce them. The most mature organizations will also be able to re-instantiate the underlying software environment (e.g., which version of language, packages) to reproduce a past result.

5. Does collaboration happen through a system other than email?

Data science is a team sport. During the course of a project, you'll likely get feedback both from technical colleagues and non-technical stakeholders. How are you sharing results and recording feedback and conversations? If it's happening over email, there's a good chance that those conversations and the organizational knowledge you're accumulating will be lost. It won't be available to new people who look back at the work later; it will be lost if the project members leave the organization; it's not searchable or discoverable later.

A good data science collaboration platform will keep work and discussion centralized, make it searchable, etc. There are plenty of ways to do it, and email is a convenient way to get work into such a platform, but email should not be the primary way that collaboration happens.

6. Can predictive models be deployed to production without custom engineering or infrastructure work?

If engineers must be involved to integrate data science output into business processes, you are delaying your time-to-market, thus reducing the value of your data science work. Infrastructure and platforms can empower data scientists to quickly "productionize" their work without an extra—and some times very long—step.

7. Is there a single place to search for past research and reusable data sets, code, etc?

Many data scientists believe they make their biggest impact when they answer a question, produce a model, or create a report. Actually, the longer-lasting, more leveraged impact is made when their work contributes to the collective knowledge of the organization in a way that can be built upon in the future. Therefore it is important that, as research progresses, it's persisted in a way that can be discovered and reused later—and the other side of that coin is that people have an easy way to find and reuse that past work.

Searching across dozens of network folders, Sharepoint sites, and repositories is not an effective way to preserve organizational knowledge. There should be a single system of record, even if that yields results that link out to auxillary systems.

8. Do your data scientists use the best tools money can buy?

We took this one straight from Joel's list. Data scientists are expensive, value-adding people—equipping them well is a great investment.

Banner image titled "Graffiti & Street Art At Portobello (Dublin)" by William Murphy. Licensed under CC BY-SA 2.0

  • Nitin A

    I struggle with some of the points, for instance “Can data scientists utilize the latest tools/packages without help from IT?”. Are data scientists developers? Because it largely depends on how you define a data scientists. If he/she doesn’t have a development background, I highly doubt he/she is using proper fundamentals in python/scala/R to do analysis/create models/etc. and building something that can be shared and re-used by other data scientists in the future. Are they using maven or gradle for their build environment? Do they write tests? For this point to be valid, data scientists need to show some sort of maturity when doing “data science” work and not treat problems as one-off, non re-producible problems. Said that, if your data science team is technical enough (don’t have to be hardcore developers) and analytic gurus, it makes this point easy to digest.

    • daftmath

      Anyone who calls themselves a data scientist should be able to write good code, but what I take this point to mean is that a data scientist can, for example, install a R package or a Python library without putting in a request to IT. For example, if the data scientist wants to try a free open-source tool like python’s NLTK they should be able to grab it and go without a week-long request process, otherwise they’ll spend more time battling IT and bureaucratic processes than doing data science. This is important because a good data scientist will experiment with a number of tools and models to find what works best.

      • Nitin A

        Agreed. The IT barriers need to be dropped, not just for data scientists but for all developers, so we can learn, fail, learn, repeat. Said that, in my experience, data scientists are not really coders, forget “good” coders. IMHO, one of the points should be around data scientists being developers, understand version control, coding practices, continuous integration, etc. It’s very easy to blame IT for their strict, non-innovative processes, but to assume that data scientists actually follow some sort of SDLC is not realistic. Either way, I see where you are going with this; its a good read and hope it catches on!

  • Kevin K

    I am coming into data science from a background as a developer, and trying to help others do the same. That perspective leads me to suggest two others: (9) do the prospective members of your data science team have enough knowledge of statistics to understand the modeling that they are doing? and similar to your #5, I would add (10) does the organization that produces the data make the data available to the analysts without requesting and waiting for extracts from their production systems?

  • Cody Peeples

    9. (should really be #1) Does your data science culture guide everyone touching sensitive data to understanding of the weight of the responsibility that they are entrusted with. Do your data scientists consider any potential privacy, security, and safety consequences that mishandling raw data as well as finished reports and anything in between may entail?

    This post represents my own personal opinions and not necessarily those of my employer, Cisco Systems, Inc.

  • There is a lot that Data Scientists might consider from a Six Sigma Process Excellence concepts like: Define, Measure, Analyze, Improve, Control. While we are at it, has your local Black Belt been engaged as a coach.