Data Science != Software Engineering

collaborationdata scienceengineering
  • Domino

Domino’s guide, “What Engineering Leaders Need to Know About Data Science”, provides insights to help engineering leaders increase data science productivity and decrease engineering time spent on avoidable tickets. This post covers the differences between data science and engineering, because it is an initial step toward more efficient data science workflows, tooling, and infrastructure. For more in-depth coverage and insights, the entire guide is available for download.

Why understanding key differences between data science and engineering matters

As data science becomes more mature within an organization, engineering leaders are often pulled into leading, enabling, and collaborating with data science team members. While there are similarities between data science and software development (e.g., both include code), well intentioned engineering leaders may make assumptions about data science that lead to avoidable conflict and unproductive workflows. Conflict and unproductive workflows that engineering leaders are then tasked with resolving.

Data science, unlike software development, is more similar to research, has unique computing demands, and the teams often work closely with business stakeholders with whom engineering teams don't typically engage.

Data science is more like research than engineering

Engineering involves building something that is already understood ahead of time. This allows engineering teams to track, monitor, predict, and control the engineering process. However, data science projects are often centered around answering a question that may turn into an insight or model. This focus on answering a question is what makes data science an exploratory and experimental research process. This also results in the need for more flexibility and agility around data science infrastructure and tooling than what is needed within engineering.

Variable computing demands

Engineering teams build software that may run on high-performance architecture. The engineering team uses infrastructure for testing and QA, and the infrastructure needs are static and predictable. Individual engineers often work on a single machine with a 16-32GB of RAM and four-to-eight cores. In contrast, data science projects’ compute capacity is not predictable and constant. Data science work involves computationally intensive experiments. Memory and CPU can be a bottleneck. For example, it could take 30 minutes to write code for an experiment and then it could take eight hours to run the experiment on a laptop. To avoid this type of bottleneck, the data scientist may utilize large machines for parallelizing work across cores or loading more data into memory.

Integration with other parts of the organization

While engineering is aligned with the organization's overall priorities, engineering teams are often independent and their work does not require close integration with finance, marketing, or HR teams. Data science projects are often focused on answering a question for a business stakeholder. For example, a data science team would work very closely with the HR team when building models for employee retention.

In this post, we discussed data science’s similarity to research, data science’s variable computing demands, and how data scientists often work closely with business stakeholders with whom engineering teams do not typically engage. If you are interested in reading more about how to enable data science within your organization as an engineering leader, then download Domino’s guide.