Collaboration Between Data Science and Data Engineering: True or False?

by on November 19, 2018

This blog post includes candid insights about addressing tension points that arise when people collaborate on developing and deploying models. Domino’s Head of Content sat down with Don Miner and Marshall Presser to discuss the state of collaboration between data science and data engineering. The blog post provides distilled insights, audio clips, excerpted quotes as well as the full audio and written transcript. Additional content on this topic will be forthcoming from additional industry experts.

Introduction

Over the past five years, we have heard many stories from data science teams about their successes and challenges when building, deploying, and monitoring models. Unfortunately, we have also heard that many companies have internalized the model myth, or the misconception that data science should be treated like software development or data assets. This misconception is completely understandable. Data science involves code and data. Yet, people leverage data science to discover answers to previously unsolvable questions. As a result, data science work is more experimental, iterative, and exploratory than software development. Data science work involves computationally intensive algorithms that benefit from scalable compute and sometimes requires specialized hardware like GPUs. Data science work also requires data, a lot more data than typical software products require. All of these needs (and more) highlight how data science work differs from software development. These needs also highlight the vital importance of collaboration between data science and engineering, particularly for innovative model-driven companies seeking to maintain or grow their competitive advantage.

Yet, collaboration between data science and engineering is a known challenge.

As “Seek Truth, Speak Truth” is just one of Domino’s core values, Domino’s Head of Content sat down with Don Miner and Marshall Presser to have a respectful and candid conversation about differing priorities, known collaboration challenges, and potential ways to address these challenges. Both Miner and Presser have extensive practical experience within data science and engineering. Miner, a founding partner of a data science and AI firm, represents the data science perspective. Presser, whom is on the data engineering team at Pivotal, represents the data engineering perspective. This blog post covers distilled highlights, key excerpted quotes, audio clips, as well as a full playback and transcript from the conversation. There will also be additional forthcoming content on this topic from additional industry experts. The purpose of this blog post and future content is to contribute to the public dialogue around the collaboration challenge that has been lacking of in-depth analytical discourse from multiple perspectives.

Data Science vs Data Engineering: How did we even get here?

The candid discussion kicked off with an examination of the current state and how did we, within data science, arrive at this current state? There is a seemingly myriad of terms to describe people who interact with models. Just a few terms that are currently in usage include researchers, data scientists, machine learning researchers, machine learning engineers, data engineers, infrastructure engineers, DataOps, DevOps, etc. Both Miner and Presser commented upon and agreed that before any assignment of any term, the work itself existed previously. Presser defines data engineering as embodying the skills to obtain data, build data stores, manage data flows including ETL, and provide the data to data scientists for analysis. Presser also indicated that data engineers at large enterprise organizations also have to be well versed in “cajoling” data from departments that may not, at first glance, provide it. Miner agreed and indicated that there is more thought leadership around the definition of data science versus data engineering which contributes to the ambiguity within the market.

Marshall Presser: “We started with data engineers before we had data scientists, I think. And data engineers did such things as build data warehouses from which people did kind of rudimentary business intelligence, and slice and dice, and kind of analysis of past state, and what the business, whatever that business is, looked like yesterday with some kind of minimum analysis of a future state. And then people came to the bright conclusion that we could actually do more with data than just report on the past. And so from my perspective, data scientists kind of entered into modern analytic thinking, if you like, 10, 15 years ago I don’t think I even heard the term data scientist. I can’t remember when I first heard it, but it was a while back. I’m not even sure I heard the word data engineer back then. We evolved these two different specialties and, Don, I’m going to pass to you in a second, but my sense of what the difference is, is that data engineers have the job of acquiring data from various sources. Massaging it, getting it into a place where then data scientists can do interesting machine learning with it. So, that’s what I think the current state looks like.

Don Miner: “I agree with pretty much everything that Marshall said…just because we coin these terms, data engineer, data scientist onto these things doesn’t mean it didn’t exist before. [These terms reached] a critical mass at a certain point where people were, “You know, we should probably call that something.” There’s enough data scientists running around , “Oh, you know what? That should have a name.” Or there’s enough data engineers running around now that, that should have a name…. people spend a lot more time defining what data science means and not so much defining what data engineering means. All the way from university curriculums… I’ve never heard of anybody having a data engineering undergrad class, but you’re starting to hear data science classes pop up. … I have some ideas about why that is, but I think where we’re at right now is data science is a pretty fairly well defined career path and profession. People generally know what that means.…there’s a lot of impact from hype still that’s starting to wear down a little bit. But on the data engineering side….has really been left alone from the typical opinionated people that would be helping define these things, talking about it at conferences. That has still left a lot of ambiguity in the market there. So, I think that’s where we’re at right now.”

How do these differences translate in real life? e.g., recruiting Data Scientists and Data Engineers?

As both Miner and Presser perspectives are grounded in practical experience, the discussion turned to how the differences between data scientists and data engineers translated into which measures and skills are prioritized in hiring and recruiting. Miner relayed that when he recruits data scientists, he looks for technical ability (i.e., machine learning) as well as potential domain expertise. Miner also countered that when he recruits data engineers, he often looks for software engineers that happen to have database experience, various technical versatility signals (i.e., working with Kaftka), as well as a “certain type of attitude”

“This is different for different organizations but… data engineers need to be really versatile, they need to have the ability to work in lots of different kinds of roles. They need to be able to write software, they need to be able to work with databases, they need to be able to do DBA things, they need to care about security, they need to care about networking. It’s a very interdisciplinary role, and so really my number one facet when I’m looking for a good data engineer is flexibility and versatility in their technical skills. And also, like you kind of mentioned, you need to have a certain type of attitude in order to succeed in working in the bowels of a data organization. They need to be very resilient in dealing with frustrating issues. Meanwhile, a data scientist, typically I’m looking for technical skills like machine learning experience. A specific skillset that I’m looking for, maybe certain domains that they’ve worked in, in the past. So, actually I would say right now for data scientists usually I’m looking for specific technical abilities. With a data engineer it’s more about attitude and versatility than it is about their specific technical skills.”

Both Presser and Miner agreed that the function of data engineering is important, particularly the navigation skills to obtain data. Miner, in particular, noted

“in our consulting engagements, and also two other data science consulting companies that I know and work with, if we have a pure play data science project, meaning that the data engineering’s not in scope, the customer said that they were going to take care of it, we won’t start work until we have proof that the data’s been loaded. We’ve been burned so many times by them saying like, “Oh, you know what? You guys can start on Monday. We’ll get the data loaded sometimes next week.” We’re not even going to start until that data’s there….that’s the other issue too with the data engineer. I actually ran into this issue….on the younger side of the data engineers, one of the issues that we run into is that they don’t have the seniority to stand up to some ancient Oracle DBA that’s not willing to play nice. …it’s a really hard role to fill because, you’re right,… the interpersonal skills, and the political navigation skills are really important for the data engineer.”

Current state of collaboration: candid insights

After exploring the differences in skills, technical abilities, and work flow priorities, the conversation moved toward very candid insights about collaboration between data science and data engineering. The challenges that arose during the conversations include challenges with communication in general, a lack of two-way respect, potential lack of good project management, and expecting data science workflow to be like software development workflow. When asked “what is the current state of collaboration? given that aspects are emerging and may differ depending on the organization”, Miner indicated

“I have two answers to this. One is that I don’t think that data scientists and data engineers at most organizations that I’m working with have figured out how to communicate with anybody. So, not even with each other, but how does a data scientist and a data engineer fit into, a modern one, that’s building some new systems, how are they interacting with different lines of business? How are they interacting with marketing, sales? How are they interacting with product design? ….even this at a fundamental level, there’s major problems in the industry. And how they’re interacting with each other? … it’s hard to say because I can’t really say that, at least in the past couple of years that I’ve had very many interactions with like, that guy’s a data scientist, that guy’s a data engineer. Our roles are clearly defined and they’re communicating. So I guess I’m going to give a non answer and say that, I don’t know, it’s too early to tell…..[from] my perspective, I can say some things about different people playing different roles in different scenarios and how they’re communicating.

But overall, I don’t think the roles are very clearly defined yet to be able to really say how they’re communicating…

In a couple of places where I have seen it be pretty functional, and you have had a functional data engineer that had responsibility for the data, and you have had the data scientist…In a lot of cases what’s not being seen enough is respect in both directions…the data engineer is like, “This data scientist doesn’t know what he’s doing. He doesn’t know how to work with data. The data scientist doesn’t know how hard this data engineering stuff is.” And on the same side, the data scientist is frustrated that the data engineer is not getting things done fast enough. Not getting it done in the format that they want it in…. the best data people I’ve worked with in both directions have had empathy for the other person’s situation. The data engineer has intuition about what the data scientist is looking for, and what they need. And the data scientist has intuition about what’s hard for the data engineer, and what’s unreasonable for that person to do…. that’s the best scenario that I’ve seen. The worst scenario, which is the one that I see typically, is the data engineers are just processing data and not being worried about things like duplicates, or like things encoded in the wrong way, and cables being laid out in ways that aren’t appropriate for data science. And then the data scientists see this stuff and they’re just like, “This is garbage. What are you doing? I’m just going to do it myself now.” And they’re going run into a whole bunch of problems because they don’t know how to access the data and stuff. I think really what it comes down to is understanding each other’s situations and understanding that they are both hard, and working through that.“

Presser also provided insight about having people aligned at the beginning of the project is important way to build empathy and address collaboration tensions and that Miner’s perspective

“is not the least bit uncommon, [it] is a symptom of really bad project management. It seems to me that the way to solve this problem is to have everybody in the room when the project is being designed … It’s sort of like life insurance. You know, you don’t really need it until you need it, but you’ve got to keep having it, even when you don’t need it. The projects that I’ve seen that have been most successful are the projects in which the data scientists, the data engineers, and… the application developers are all there in the room from the beginning, with the customer talking about what the problem is they want to solve, what a minimal product is, what the final solution should be, what the users expect out of this. And if you start from that place you’re much more likely to get empathy. …That’s the first thing.

The second thing is that, I find such difficulties that Don described don’t exist, at least in many of the projects I’ve been working on, between the data scientists and the data engineers as much as between the data scientists and the data engineers and the applications developers. Because the application developers have, I don’t want to say contempt for data, that’s way too strong, but what I would say is they don’t have as much experience and love of data that Don and I do.

To them, a database is a database, data is data, oil is oil. You know, it’s all the same. They’re not interested in thinking about, in general, the kinds of data collection and issues that they’re going to need to solve the problem. They’re sort of, “Let me come out with a minimal, viable application really quickly.” And, by the way, I’ve actually heard a project manager say, “You know, any line of code that my developers write to audit what they’re doing, to put stuff in a database, is a line of code that they’re not putting in developing the application.” And so they frequently encourage a huge technical debt as they’ve got this great application now, but when it comes time for phase two of the project, to do something interesting with the data that this application should have stored somewhere but didn’t, we’re kind of left holding the bag because the application developers were kind of short sighted. And to my mind this is the kind of short term thinking that hinders really good data science.”

Another potential point of tension includes organizations treating data science similar to software development, Miner noted

“something that we advise our clients on all the time, and is a major portion that I think takes people by surprise sometimes, is that most organizations is that their default is to treat their data science projects like software engineering projects that they’re currently running at the organization. So if they want their data scientists to be filling out Jira tickets and have Sprints. Not only the data scientists, but data engineering is not a similar task like that either. And the platform architecture too, is similar. They all share something in common. in data science, data engineering, and platform architecture, it’s one of those things where you can spend forever on something and it won’t be done. So, it’s all about, “When do I feel like stopping?” Or, “When do I run out of money?” Rather than, “Okay, this application is done. I’ll ship it, it’s in a box. It’s all good to go. We release it to the world and we sell it. It’s great.” On the data science side it’s hard to tell how long something’s going to take until you do it. So there’s this chicken and egg problem. I can’t write the Jira ticket it’s going to take two weeks, until I actually spend the two weeks to do it, and realize it’s actually going to take four weeks. And so when you try to apply these traditional software engineering project management things on these projects it doesn’t work. It actually causes harm in a lot of cases….there’s actually a new discipline that needs to arise.”

Addressing the collaboration challenges: in-person communication, collaboration tools, and a Data Liaison

Collaboration between data science and data engineering is a hard problem to solve for. While there was consensus that the difficulty of the problem has contributed to a lack of extensive public discourse, Miner and Presser dove into aspects that have the potential to ease the tension points around collaboration. Prior in the conversation, aspects to support collaboration that naturally arose included early stakeholder alignment, as well as mutual respect and intuition regarding various responsibilities. Also, when asked directly to problem solve for potential ways to address the collaboration tension points that provide barriers to developing and deploying models, additional suggestions about corporate culture, collaboration tools, and a “data liaison” arose.

Presser noted that corporate culture contributes to collaboration, specifically

“I think it’s, in many ways, a corporate culture kind of thing. There are organizations that work well together, and they are others that don’t, and I work a lot in the federal government space where this project consists of people from various organizations that are not part of the federal government. Outsourced project management, outsourced database management, outsourced this, outsourced that. And there’s a little fighting over fiefdom here, and a customer either can’t do anything about it for contractual reasons, or chooses not to do anything about it. But that’s the opposite of what Don was talking about in terms of empathy and respect, and its driven by, in many ways, where the revenue dollars are coming from. So, I find that some organizations I like working with, some organizations I don’t like working with because the corporate culture is not one of sharing, of empathy and respect. So choose your partners well.”

Miner also agreed with corporate culture as contributing:

“I think the best organizations that I’ve worked for have been the ones that fostered open communication, no competition within. Not very many egos…you can get away with it in a lot of other things, and data projects are not one of them. That’s the issue, … an organization that has maybe found that [many egos] successful for other types of work that they’ve done, in this case it’s not very successful…my answer to the question of what I’ve seen work well, I think one of the big ones to me is, everybody having a good energy, and knowing what the goals are. And I think that also ties into corporate culture as well. A corporate culture that has very clear goals, or a leader that has very clear goals, that’s being very transparent about what those goals are, allows everybody to align themselves, their little micro interactions throughout the day, to be part of those goals. Also, goals in data science are often weird. Sometimes they’re not straightforward.”

While Presser is an advocate prioritizing in-person collaboration to accelerate work and address collaboration, Miner advocates for having a “data liaison” person as well collaboration tools due to the nature of data science work:

“The other thing that I wanted to add onto what Marshall said about fundamental communication, because I do agree that too often not all the stakeholders and not all the people are going to be the different identifies are going to be involved early on in the discussions. This is actually where a lead data science liaison type role fits in a company where you don’t necessarily need your data scientist, like at a large organization, being involved in every decision, but having a data science leader, that’s a chief data officer, or chief data scientist, or whatever the title is, I don’t think it’s really nailed down, is involved in these scoping meetings. We’ve seen that be successful. Maybe another thing too on the communication standpoint, I’m actually going to provide a vote for real time remote collaboration tools in working…..I agree that in the beginning of the project it’s really good to get everybody in the room due to the amount of communication that needs to happen. But also, too, email feels almost too slow for these projects. Data scientists are kind of trickling in on insights, and data engineers too, are running into different problems in an ad hoc way as they’re actively working. So we use Slack a lot, I think a lot of people do right now and it’s been pretty successful, because you don’t have to bunch up a bunch of stuff to put into an email like, “Here’s my list of problems today.” Maybe you may have two data scientists talking about an issue and the data engineer is eavesdropping and saying, “Oh hey, by the way, this is how I designed it,” or like, “Oh hey, yeah I can fix that for you real quick. Not going to take me much time at all.” So this more real time communication is good, and I think also too, it’s almost better than in a physical office in some cases too. Even if you’re sitting at a desk, three desks away from the data engineer, you still have to get up and go bother that person. Here, I think I’m actually making the argument that I think Slack and other things like it, may actually be one of the best tools for this thing right now, as the project’s going on.”

When queried to unpack the idea of a “data liaison” more and provide additional clarity and whether this person could be a “project manager”, Miner indicated

“…in a consulting construct, that both myself and Niels [co-founder] provides in some of our larger projects. And it’s a really necessary role and some of the other customers we work with, we’ve made this recommendation for them to do this, it’s actually two reasons. One is that, data science requires a lot of focus. When you’re working on data science problem and you’re fumbling with some machine learning thing, you’re messing with the data, an interruption can break down a house of cards in your head that you’ve been building for multiple hours and if you’re responsible for going around to random meetings to discuss use cases and things, you’re never going to get anything done…what you need to do, is you need to kind of pick somebody. I mean honestly, these are some personality types that are better than others, but really it needs to be somebody that could do it if they had to, that understands the real problems, that can represent the data scientists that are actually going to do the work in these meetings. But due to the focus requirement you kind of need to pick somebody to be the sacrificial person to do it, that’s okay going around and talking from experience so that the others can focus. It’s a really important role… in a large organization with a large team.”

Reflections on the potential future state

After problem solving for potential aspects that may help ease collaboration tensions, the discussion moved to what the potential future state of collaboration could look like. Potential future state scenarios discussed include increasing specialization of roles as well as the need for a discipline or process to help manage collaboration.

Marshall Presser: “ …from a future state perspective, I think specialization of roles is only going to increase. We’re going to get people are purely data scientists, people who are purely application developers, people who are purely data engineers, people who are purely platform architects, people who are purely liaison, people who are purely project management that may tie in to the liaison role, and keeping these people coordinated and so that they can, one, speak a common language and they have sympathy and respect for one another, I think that’s a challenge going forward. But once we solve that problem, it’ll be great.“

Don Miner: “I think on Marshall’s point, … the biggest problem here about this lack of process around management, around data engineering, the communication between data engineering and data science, this lack of management, if you want to specialize, you want to have a data liaison…do you want to have a data engineer specialist, because the earliest data science project, like the smallest one, data scientist is doing the data engineering work too. And probably the platform architecture work too, and the application development.

Once you start specializing, which is why we have data engineers and data scientists now, these two people need to have a process to communicate.

When you have an application developer, now they need a process to communicate and work together.

You have the platform architecture, you got management, you got the advisory liaison person, you got the rest of the business, all is about process and, honestly, I don’t think anybody really knows what they’re doing. I think the number one thing that’s holding us back in this industry, is building large data science teams and organization. The most successful data science teams I see right now are like three people… it could be a massive organization, but those three people are getting a lot of work done, and if they wanted to scale up to 20 people, 40 people, it’s not going to work. I actually have a specific anomaly that I saw the other day, where I’m hiring a new data scientist in Denver. Particularly wanted a senior data scientist in Denver, so I posted a job opening on LinkedIn for a Denver data scientist. I got something like 30 applications in a few days. 11 were from one company…. I ask some of my colleagues that are in Denver, saying “What’s wrong with company X? I just got 11 applications from data scientists from this company.” First of all I didn’t even know they had a lot of data scientists, and they said … because [they] are data scientists, and they [said] “Yeah, they’re job openings are all over the place. They hired a crazy number of … hundreds of data scientists over the past two years.” … now obviously they’re hemorrhaging, because they probably didn’t actually think about how to communicate. I think that’s where I would like to see the world go, is if we had better processes, just like we got through on the software engineering side, continuous integration and testing, good UX principles and things like that. We can build really scalable software teams now.

Data science isn’t there yet…… the topic of the data engineer and data science thing though, is the tip of that spear.”

Conclusion: managing data science: hard, but not impossible

Don Miner: “There’s not really very many practitioners out there saying, “How do I manage a data science project well?”…. Somebody’s going to have to talk about it at some point.”
Ann Spencer: “Why do you think that is? Why do you think that people aren’t talking it, or aren’t addressing it?”
Marshall Presser: “Well, for one, it’s hard.”

One of Domino’s core values includes “Seek Truth, Speak Truth”. We leverage this core value in our content to support people tackling hard, and perhaps, previously unsolvable problems within data science. This blog post covered distilled insights, audio clips, and excerpted quotes from a candid discussion about tension points that arise when people collaborate around the development and deployment of models. If interested in more in-depth insights, then consider listening to the over 45 minute audio recording or reading through full transcript. Both are provided below. We also realize that there are additional situations, nuances, and textures regarding collaboration that were not covered in this blog post and are working with additional industry experts to amplify different perspectives. We will continue to provide additional forthcoming content that covers collaboration between data science and engineering. If you are interested in contributing to this public discourse, contact us at writeforus(at)dominodatalab(dot)com.

Full audio recording

If interested, this section provides over 45 minutes of the discussion.

Full text transcript of audio recording

This section provides the full text transcript of the audio recording. The text has been edited for readability.

Ann Spencer, Head of Content, Domino: For the discussion today we’re going to start off with a baseline in terms of what the current state is, and then we’re going to go into everyone’s perspectives regarding the current state, as well as perspectives on the future state. Just as a broad overview. To kick off the discussion, why don’t we talk about what’s going on. Let’s talk about the emerging roles within data science and data engineering. We hear many titles and roles being discussed that touch models. These titles or roles have terms like “data scientists”, “researchers”, “machine learning engineers”, “machine learning researchers”, “data engineers, etc, etc. So, what is the current state and how did we get here?

Marshall Presser, Data Engineering, Pivotal: Okay, well, from my perspective as a data engineer. We started with data engineers before we had data scientists, I think. And data engineers did such things as build data warehouses from which people did kind of rudimentary business intelligence, and slice and dice, and kind of analysis of past state, and what the business, whatever that business is, looked like yesterday with some kind of minimum analysis of a future state. And then people came to the bright conclusion that we could actually do more with data than just report on the past. And so from my perspective, data scientists kind of entered into modern analytic thinking, if you like. 10, 15 years ago I don’t think I even heard the term data scientist. I can’t remember when I first heard it, but it was a while back. I’m not even sure I heard the word data engineer back then. We evolved these two different specialties and, Don, I’m going to pass to you in a second, but my sense of what the difference is, is that data engineers have the job of acquiring data from various sources. Massaging it, getting it into a place where then data scientists can do interesting machine learning with it. So, that’s what I think the current state looks like.

Don Miner, Founding Partner, Miner & Kasch: Yeah. I agree with pretty much everything that Marshall said…just because we coin these terms, data engineer, data scientist onto these things doesn’t mean it didn’t exist before. I think they just got to a critical mass at a certain point where people were like, “You know, we should probably call that something.” There’s enough data scientists running around like, “Oh, you know what? That should have a name.” Or there’s enough data engineers running around now that, that should have a name…people spend a lot more time defining what data science means and not so much defining what data engineering means. All the way from university curriculums… I’ve never heard of anybody having a data engineering undergrad class, but you’re starting to hear data science classes pop up. So, you know, I have some ideas about why that is, but I think where we’re at right now is data science is a pretty fairly well defined career path and profession. People generally know what that means. Though I think there’s a lot of impact from hype still that’s starting to wear down a little bit. But on the data engineering side, I think has really been left alone from the typical opinionated people that would be helping define these things, talking about it at conferences. That has still left a lot of ambiguity in the market there. So, I think that’s where we’re at right now.

Marshall Presser: Certainly you’re quite right. I’ve never seen anything that calls itself a data engineering curriculum. Yet, the stuff that we do has been done for years, but it’s just never been glorified, to your point, with a name until recently.

Don Miner: I’ll echo this too. When I’m recruiting for the company I work for, if I’m looking for a data scientist on LinkedIn and I use my recruiter tool on there, and I search for data scientists, I find a bunch of data scientists. When I’m looking for a data engineer I can’t type in data engineer. There are people that put it as their title, but a lot of times I’m looking for a software engineer, for example, their title is software engineer and they have like database expertise, and Hadoop expertise on their resume. Or I’m looking for like a DBA that has picked up Kafka and things, right? So it’s not as stamped as data science is.

Marshall Presser: So, when you look for a data engineer, I mean in a perfect world where you could examine everyone’s resume in gory detail and talk to them, what are the skillsets that you’re looking for in general, and do you find them?

Don Miner: Finding a good data engineer actually, I think, is harder than finding a good data scientist. …The reason is the answer I’m about to give, which is, I think when I’m looking for a data engineer, and this is different for different organizations, but data engineers need to be really versatile, they need to have the ability to work in lots of different kinds of roles. They need to be able to write software, they need to be able to work with databases, they need to be able to do DBA things, they need to care about security, they need to care about networking. It’s a very interdisciplinary role, and so really my number one facet when I’m looking for a good data engineer is flexibility and versatility in their technical skills. And also, like you kind of mentioned, you need to have a certain type of attitude in order to succeed in working in the bowels of a data organization. They need to be very resilient in dealing with frustrating issues. Meanwhile, a data scientist, typically I’m looking for technical skills like machine learning experience. A specific skillset that I’m looking for, maybe certain domains that they’ve worked in, in the past. So, actually I would say right now for data scientists usually I’m looking for specific technical abilities. With a data engineer it’s more about attitude and versatility than it is about their specific technical skills.

Marshall Presser: One of the things I find I do as a data engineer that I think data scientists don’t do is, in large organizations that are separated into fiefdoms or domains, when the data scientists and I sit down and talk about the data that we need to solve a problem, it comes from lots of different places. Part of the job of the data engineer, with the help of others, is cajoling the people who own that data to give it up to, “spend valuable machine cycles” sucking that data out, let’s say of mainframe databases so that they can get put into a more analytic framework someplace. And dealing with the sort of political issues of getting data rather than the truly technical issues. I’m not trying to minimize the technical issues. But you guys [data scientists], in general, and I hate to make this an us versus you, but for the moment you guys in general don’t deal with that as much as I think data engineers do.

Don Miner: Yeah, absolutely. So in our consulting engagements, and also two other data science consulting companies that I know and work with, if we have a pure play data science project, meaning that the data engineering’s not in scope, the customer said that they were going to take care of it, we won’t start work until we have proof that the data’s been loaded. We’ve been burned so many times by them saying like, “Oh, you know what? You guys can start on Monday. We’ll get the data loaded sometimes next week.” We’re not even going to start until that data’s there. That’s the other issue too with the data engineer. I actually ran into this issue… on the younger side of the data engineers, one of the issues that we run into is that they don’t have the seniority to stand up to some ancient Oracle DBA that’s not willing to play nice. So I think it’s a really hard role to fill because, you’re right, … the interpersonal skills, and the political navigation skills are really important for the data engineer.

Don Miner: Every single day a data engineer could probably find a million excuses of why they don’t have to do their job that day because there’s something else that’s somebody else’s fault, right? ” That guy’s not giving me data, so I guess I’m just going to take a long lunch.” Or, “The data’s dirty, it doesn’t work so I’m just going to give up and go work on my next problem.” Right? There’s like a million excuses and frustrations and so the data engineer needs to be really flexible and resilient for sure.

Ann Spencer: Given all these different aspects that both of you have brought up where it comes to political navigation, the data itself, working with individual stakeholders, what is the relationship like between data science and data engineering? Given that it’s still emerging, or people have different definitions of it, or it depends on the situation? From both of your perspectives, what is that relationship like? Or what does the collaboration look like?

Don Miner: I have two answers to this. One is that I don’t think that data scientists and data engineers at most organizations that I’m working with have figured out how to communicate with anybody. So, not even with each other, but how does a data scientist and a data engineer fit into, a modern one, that’s building some new systems, how are they interacting with different lines of business? How are they interacting with marketing, sales? How are they interacting with product design? Like, even this at a fundamental level there’s major problems in the industry. And how they’re interacting with each other? I think it’s hard to say because I can’t really say that, at least in the past couple of years that I’ve had very many interactions with like, that guy’s a data scientist, that guy’s a data engineer. Our roles are clearly defined and they’re communicating. So I guess I’m going to give a non answer and say that, I don’t know, it’s too early to tell. Like in my perspective, I can say some things about different people playing different roles in different scenarios and how they’re communicating. But overall, I don’t think the roles are very clearly defined yet to be able to really say how they’re communicating…In a couple of places where I have seen it be pretty functional, and you have had a functional data engineer that had responsibility for the data, and you have had the data scientist, I think in a lot of cases what’s not being seen enough is respect in both directions. I think the data engineer is like, “This data scientist doesn’t know what he’s doing. He doesn’t know how to work with data. The data scientist doesn’t know how hard this data engineering stuff is.” And on the same side, the data scientist is frustrated that the data engineer is not getting things done fast enough. Not getting it done in the format that they want it in…the best data people I’ve worked with in both directions have had empathy for the other person’s situation. The data engineer has intuition about what the data scientist is looking for, and what they need. And the data scientist has intuition about what’s hard for the data engineer, and what’s unreasonable for that person to do…that’s the best scenario that I’ve seen. The worst scenario, which is the one that I see typically, is the data engineers are just processing data and not being worried about things like duplicates, or like things encoded in the wrong way, and cables being laid out in ways that aren’t appropriate for data science. And then the data scientists see this stuff and they’re just like, “This is garbage. What are you doing? I’m just going to do it myself now.” And they’re going run into a whole bunch of problems because they don’t know how to access the data and stuff. I think really what it comes down to is understanding each other’s situations and understanding that they are both hard, and working through that.

Marshall Presser: So, I think what you’ve just described, which is not the least bit uncommon, is a symptom of really bad project management. It seems to me that the way to solve this problem is to have everybody in the room when the project is being designed even though, for 80% of the time the person … it’s sort of like life insurance. You know, you don’t really need it until you need it, but you’ve got to keep having it, even when you don’t need it. The projects that I’ve seen that have been most successful are the projects in which the data scientists, the data engineers, and god help us, the application developers are all there in the room from the beginning, with the customer talking about what the problem is they want to solve, what a minimal product is, what the final solution should be, what the users expect out of this. And if you start from that place you’re much more likely to get empathy. Because people who work together, and physically in the same room, while I love remote conferencing, people in the same room for a project, scoping, inception, get things done better than people spread out across the globe. That’s the first thing. The second thing is that, I find such difficulties that Don described don’t exist, at least in many of the projects I’ve been working on, between the data scientists and the data engineers as much as between the data scientists and the data engineers and the applications developers. Because the application developers have, I don’t want to say contempt for data, that’s way too strong, but what I would say is they don’t have as much experience and love of data that Don and I do. To them, a database is a database, data is data, oil is oil. You know, it’s all the same. They’re not interested in thinking about, in general, the kinds of data collection and issues that they’re going to need to solve the problem. They’re sort of, “Let me come out with a minimal, viable application really quickly.” And, by the way, I’ve actually heard a project manager say, “You know, any line of code that my developers write to audit what they’re doing, to put stuff in a database, is a line of code that they’re not putting in developing the application.” And so they frequently encourage a huge technical debt as they’ve got this great application now, but when it comes time for phase two of the project, to do something interesting with the data that this application should have stored somewhere but didn’t, we’re kind of left holding the bag because the application developers were kind of short sighted. And to my mind this is the kind of short term thinking that hinders really good data science.

Don Miner: Yeah, I think your bringing up the application developer is a really good point. And it’s something that we started doing on our projects when we have complete control. There’s also type of person which is the platform architect. Our dream team, if the customer gives us everything we want, which at the end of the day we usually consolidate some roles, is the data scientist, or data scientists, the data engineers, the application developers, and the platform architects that are building the underlying system that thing’s running on. And they all have to have this empathy that I brought up. The platform architects need to build the platform that the data scientists are going to use and data engineers can utilize. It needs to be able to host the application. The application needs to be able to do things in a scalable manner and visualize data, and all these things. I think Marshall’s point, I chuckle pretty hard at this, is that good project management of managing those four personalities is really hard. There’s always a shortage of good project managers. But I think also, this is something that I think a lot about, that I wish I had more time to write down. I bet Ann wishes I had more time to write down. The philosophy around managing a data science project, or not the data science project but a data project in general. I think the industry knows pretty well how to build software at this point, at least in comparison to when I started, which isn’t too long ago. I think data just hasn’t gotten there yet. Also, I don’t really see the industry moving in that direction either. There’s not really very many practitioners out there saying, “How do I manage a data science project well?” I don’t see very many people talking about that. Somebody’s going to have to talk about it at some point.

Ann Spencer: So, why do you think that is? Why do you think that people aren’t talking it, or aren’t addressing it?

Marshall Presser: Well, for one, it’s hard. And, two, as Don points out, while there is no such program called The Master’s in Data Engineering, there’s no such program called The Master’s of Product Management of Data Science. So we don’t have a lot of people with the skillset, or the experience of managing these projects. And Don is absolutely right, there is a dearth of really good project managers, and in many cases for financial and revenue constraints they’re spread too thin, managing multiple projects simultaneously. So, when you don’t get adult supervision, so to speak, between all the people working together that Don mentioned on this project then things tend to fray at the edges, and that’s why we see a lot of failures or successes that are not as good as they might be.

Don Miner: I want to add to that, and this is something that we advise our clients on all the time, and is a major portion that takes people by surprise sometimes, is that most organizations is that their default is to treat their data science projects like software engineering projects that they’re currently running at the organization. So if they want their data scientists to be filling out JIRA tickets and have Sprints. Not only the data scientists, but data engineering is not a similar task like that either. And the platform architecture too, is similar. They all share something in common. in data science, data engineering, and platform architecture, it’s one of those things where you can spend forever on something and it won’t be done. So, it’s all about, “When do I feel like stopping?” Or, “When do I run out of money?” Rather than, “Okay, this application is done. I’ll ship it, it’s in a box. It’s all good to go. We release it to the world and we sell it. It’s great.” On the data science side it’s hard to tell how long something’s going to take until you do it. So there’s this chicken and egg problem. I can’t write the JIRA ticket it’s going to take two weeks, until I actually spend the two weeks to do it, and realize it’s actually going to take four weeks. And so when you try to apply these traditional software engineering project management things on these projects it doesn’t work. It actually causes harm in a lot of cases. I think there’s actually a new discipline that needs to arise. You know, I have some ideas about how we like to manage projects successfully but it’s not like I can find a project manager that shares these views with me out there so it’s pretty challenging to find somebody who actually thinks about it in this way.

Marshall Presser: And I think one of the points that Don makes is that data science, data engineering, platform architecture are much more sort of waterfall than application development. I mean, if you’re going to build a platform, yes, you can build version one of the platform, and then version two, and then version three. But building version one of the platform, and building version one of the data takes considerably more time than building version one of the application, whatever the application does. And similar to Don’s point, you don’t know how long it’s going to take until you spend some time investigating the problem. So, there at odds together, and there’s a tension in this world. And until people sort of come to grips and understand this, and the customer comes to grips and understands this, I think we’re going to continue to have this problem.

Don Miner: Yeah, I agree. And also, on the waterfall thing, I want to mention one of our key tenets that we like to follow, which is something we call end-to-end data science, which basically means that when we build a project our first goal is to get from data extract to some sort of data result in a UI as quickly as possible. When I say quickly as possible I’m saying like a couple of weeks. So, instead of using a real data platform we may just use like a file share and some prod scripts. Instead of real data science we may just pile something into a logistic regression and just shoot out results. And then for an application maybe it’s just like a straight HTML page that just shoots out the table, right? It’s hard because you want to do it waterfall…that’s the natural thing to do, but you can take your raft down the first two waterfalls, and then like fall off a cliff on the third one, right? And you want to kind of like scout it out a little bit first. And where all that stuff that you build in the first iteration probably within a month is going to be all scrapped. But there’s that exploration aspect to the project that’s really hard to encode in traditional project management where we want to build this thing end-to-end and see how it works, and we’re not going to keep any of this code. And there aren’t JIRA tickets for this either. And so I think that’s really hard. And then, once you kind of get things going you can start moving more towards the waterfall approach. The other thing I wanted to mention about waterfall as well, especially on the platform architecture side, and the data engineering side, is I think like AWS, Google Cloud, and Azure, and the adoption of that in major organizations, and also then you have the other stack, like parts of the stack, like Cloud Foundry and Kubernetes as well, which is really helping from an infrastructure side, build a more agile infrastructure. So, I mean, 10 years ago when Marshall and I were going around working on Hadoop clusters, they had to build the Hadoop cluster first. Like, that was not trivial. They had to go buy hardware, they had to network it, rack and stack it. I have physically helped rack and stack things just so I could get to my project faster. And cloud has really shorted the waterfalls to be like smaller waterfalls at least. And so you can kind of, a lot of times, get away with this hybrid like, do everything at once, and waterfall kind of at the same time, which is really what you want to try to strive for, I think. I’m actually not sure. It’s kind of a mess no matter what way we look at it.

Marshall Presser: Now, I think you’re spot on there. I think you’ve really got it because the ability in an open cloud platform to build something in a half an hour, where it took me more time to cable up the switch, much less do anything interesting. The ability to spin up an environment that, as Don points out, you can play with and you decide, “Nah, this is not what I really need. But it was great getting me to the point where I knew what it is I really needed.” I mean, that’s absolutely critical, and that’s made my job an order of magnitude easier. I’m a database guy. For me, I go over to my favorite cloud platform and run a cloud formation script and, depending upon how big, and what I want, by the time I go to the coffee machine, fill it up, drink my cup of coffee, my infrastructure’s there ready for me. And frequently it’s built with a while set of tools that people other than me, need in the project. That the application people need, that the data scientist people need. So there’s a whole ability. And if we decide it’s no good what does it cost us? It costs us, what is it? 20 bucks an hour for a week? Then we build something that’s useful.

Don Miner: Until you forget to turn it off, and then a year later you realize you owe Amazon $20,000.

Marshall Presser: Yeah. No, I’ve got scripts that actually, a Cron job, that shuts down my infrastructure every night and on the weekends. So, I’ve been there. I’ve been there running something for six months because I forgot to turn it off.

Don Miner: Yep.

Ann Spencer: There has been a lot of things that have been mentioned so far about why this particular problem hasn’t been addressed. Because it’s hard, because there’s a lack of project management, or product management skills, or things that are emerging. I have heard various things such as, there needs to be a higher degree of respect, of empathy, grown ups, project management, etc. I’ve heard a lot of things about what needs to be in place to address it as there’s no Master’s of Product Management for data science projects, or data projects. Are there any other aspects that either of you can think of, in addition to being aware of workflows, or doing more experimentation, or things like that, that could address some of the collaboration issues? Or address moving things along, or that you’ve seen that has worked really well in terms of having a project move along with more ease, or anything like that?

Marshall Presser: Well, from my perspective, having everyone in the room on day one is hugely important because decisions will get made without relevant players in the room. And involving the customer as much as you possibly can is also a huge win. There’s nothing worse than building something, spending weeks on it, months on it, and then the customer says, “You know, this isn’t really what I wanted.” So, those two things can help. I think it’s, in many ways, a corporate culture kind of thing. They’re organizations that work well together, and they’re are others, and I work a lot in the federal government space where this project consists of people from various organizations that are not part of the federal government. Outsourced project management, outsourced database management, outsourced this, outsourced that. And there’s a little fighting over fiefdom here, and a customer either can’t do anything about it for contractual reasons, or chooses not to do anything about it. But that’s the opposite of what Don was talking about in terms of empathy and respect, and its driven by, in many ways, where the revenue dollars are coming from. So, I find that some organizations I like working with, some organizations I don’t like working with because the corporate culture is not one of sharing, of empathy and respect. So choose your partners well.

Don Miner: Yeah, I think that’s a good point. The best organizations that I’ve worked for have been the ones that fostered open communication, no competition within. Not very many egos…you can get away with it in a lot of other things, and data projects are not one of them. That’s the issue, is that I think in an organization that has maybe found that successful for other types of work that they’ve done, in this case it’s not very successful… my answer to the question of what I’ve seen work well, I think one of the big ones to me is, everybody having a good energy, and knowing what the goals are. And I think that also ties into corporate culture as well. A corporate culture that has very clear goals, or a leader that has very clear goals, that’s being very transparent about what those goals are, allows everybody to align themselves, their little micro interactions throughout the day, to be part of those goals. Also, goals in data science are often weird. Sometimes they’re not straightforward. The other thing that I wanted to add onto what Marshall said about fundamental communication, because I do agree that too often not all the stakeholders and not all the people are going to be the different identifies are going to be involved early on in the discussions. This is actually where a lead data science liaison type role fits in a company where you don’t necessarily need your data scientist, like at a large organization, being involved in every decision, but having a data science leader, that’s a chief data officer, or chief data scientist, or whatever the title is, I don’t think it’s really nailed down, is involved in these scoping meetings. We’ve seen that be successful. Maybe another thing too on the communication standpoint, I’m actually going to provide a vote for real remote collaboration tools in working, so I agree that in the beginning of the project it’s really good to get everybody in the room due to the amount of communication that needs to happen. But also, too, email feels almost too slow for these projects. Data scientists are kind of trickling in on insights, and data engineers too, are running into different problems in an ad hoc way as they’re actively working. So we use Slack a lot, I think a lot of people do right now and it’s been pretty successful, because you don’t have to bunch up a bunch of stuff to put into an email like, “Here’s my list of problems today.” Maybe you may have two data scientists talking about an issue and the data engineer is eavesdropping and saying, “Oh hey, by the way, this is how I designed it,” or like, “Oh hey, yeah I can fix that for you real quick. Not going to take me much time at all.” So this more real time communication is good, and I think also too, it’s almost better than in a physical office in some cases too. Even if you’re sitting at a desk, three desks away from the data engineer, you still have to get up and go bother that person. Here, I think I’m actually making the argument that I think Slack and other things like it, may actually be one of the best tools for this thing right now, as the project’s going on.

Marshall Presser: Yeah I agree with you. One thing you said, I’d sort of like to elaborate on or get some more information on. The notion of data science liaison person as part of the project, does this person need to be a working data scientist or does this person … can this person be, I don’t want to say a project manager, but someone who understands data scientists, but couldn’t necessarily code it themselves.

Don Miner: Yeah, so this is a great question and I think this is a role that, in a consulting construct, that both myself and Niels [co-founder] provides in some of our larger projects. And it’s a really necessary role and some of the other customers we work with, we’ve made this recommendation for them to do this, it’s actually two reasons. One is that, data science requires a lot of focus. When you’re working on data science problem and you’re fumbling with some machine learning thing, you’re messing with the data, an interruption can break down a house of cards in your head that you’ve been building for multiple hours and if you’re responsible for going around to random meetings to discuss use cases and things, you’re never going to get anything done. So I think what you need to do, is you need to kind of pick somebody. I mean honestly, these are some personality types that are better than others, but really it needs to be somebody that could do it if they had to, that understands the real problems, that can represent the data scientists that are actually going to do the work in these meetings. But due to the focus requirement you kind of need to pick somebody to be the sacrificial person to do it, that’s okay going around and talking from experience so that the others can focus. It’s a really important role in a large organization with a large team.

Marshall Presser: That’s a good point about the focus you need. Sometimes, even as a data engineer, sometimes you just need to sit there and not be interrupted by meetings, by discussions, by anything else. I know sometimes my wife will come in while I’m working, and I’m slightly rude to her, and I say, “Honey, can this wait for 15, 20 minutes? I got to work this out.” And I suspect that’s true of all technical people, not just data scientists.

Don Miner: Absolutely, I just think data science in particular I think is harder to start and stop for me personally in terms of other types of technical work I’ve done in the past. And actually, kind of adding on to the fact, we’re a consulting company and we actually insist on ability to work away from their offices for at least three days a week. Early on in the company when we were just sending our guys to the site every day, the customer would go bother them with other stuff, they’re like, “Hey, can you come talk to us about this use case?” Or like, “Can you teach us how this thing works?” And we never got anything done and then at the end of the project they’re like, “Why didn’t you guys get anything done?” Well you guys kept bothering us. It’s really important to think about the people on this project, all of them. This is, like Marshall’s point, this is true for a lot of the technical people. I just think, kind of like a similar theme we were talking about earlier, data science just makes it worse because it’s a little bit harder. So you need to allow them to focus at some point.

Marshall Presser: That’s an interesting point and lets get back to Slack on that, which I like Slack sometimes, but I find it too intrusive. So when I’m working on something really hard, I basically turn off Slack so I don’t get interrupted by it. There’s this balancing, like everything else in life, there’s a balancing between too much and too little, and the wisdom is finding the golden mean right there in the center.

Don Miner: Yeah, absolutely. I actually tell consultants to turn their Slack off if they want to. We could probably have a whole discussion on how to use Slack effectively. Plus, we have a lot of opinions about that.

Ann Spencer: Do you think that it would help workflows if there was someone who was dedicated to be on the client site to address some of the questions whereas the other two data scientists, or the other two people, would be offsite, right? Going back to the whole project manager, or product manager, or data liaison is the one that’s onsite, but then everyone else is offsite.

Don Miner: Yup.

Marshall Presser: But do they need to be onsite? Do they need to be onsite or do they just need to be available?

Don Miner: I think it depends on the client. You got some people that are more … some corporations are so large that people have remote meetings … they have Webex’s just because they don’t feel like walking across the building, right? I think there’s some corporate companies that are just capable of working remotely. Culturally, whatever, some of them are not. We have a client that really needs hands-on attention. I think, and actually from my perspective, which I think is unique as a consultant, is when we pitch our largest project that we have, we separate out the advisory services as a separate engagement from the actual data science project that we’re doing. We identify a separate individual for that as well. And that’s part of the package of what we’re usually doing, and we’ve identified that as something that needs to be there. And if a client says, “You know what, we don’t want to pay for advisory services. I think we’ll be okay with just the project.” I mean, we usually talk them into it because I think it’s really important as part of being successful.

Marshall Presser: Which gets me to a point that I actually have that I didn’t talk about, which is that from a future state perspective, I think specialization of roles is only going to increase. We’re going to get people are purely data scientists, people who are purely application developers, people who are purely data engineers, people who are purely platform architects, people who are purely liaison, people who are purely project management that may tie in to the liaison role, and keeping these people coordinated and so that they can, one, speak a common language and they have sympathy and respect for one another, I think that’s a challenge going forward. But once we solve that problem, it’ll be great.

Don Miner: Well, actually, I think this … Ann, do you mind if make a closing thought here?

Ann Spencer: Oh go ahead. Feel free, because I think we’re definitely at the future state.

Don Miner: I think on Marshall’s point, I think the biggest problem here about this lack of process around management, around data engineering, the communication between data engineering and data science, this lack of management, if you want to specialize, you want to have a data liaison, do you want to have a data engineer specialist, because the earliest data science project, like the smallest one, data scientist is doing the data engineering work too. And probably the platform architecture work too, and the application development. Once you start specializing, which is why we have data engineers and data scientists now, these two people need to have a process to communicate. When you have an application developer, now they need a process to communicate and work together. You have the platform architecture, you got management, you got the advisory liaison person, you got the rest of the business, all is about process and, honestly, I don’t think anybody really knows what they’re doing. I think the number one thing that’s holding us back in this industry, is building large data science teams and organization. The most successful data science teams I see right now are like three people. I mean, it could be a massive organization, but those three people are getting a lot of work done, and if they wanted to scale up to 20 people, 40 people, it’s not going to work. I actually have a specific anomaly that I saw the other day, where I’m hiring a new data scientist in Denver. Particularly wanted a senior data scientist in Denver, so I posted a job opening on LinkedIn for a Denver data scientist. I got something like 30 applications in a few days. 11 were from one company.

Marshall Presser: Oof.

Don Miner: I was like, “What the hell is going on there?” So I ask some of my colleagues that are in Denver, saying like, “What’s wrong with company X? I just got 11 applications from data scientists from this company.” First of all I didn’t even know they had a lot of data scientists, and they said … because my guys are data scientists, and they were saying, “Yeah, they’re job openings are all over the place. They hired a crazy number of … hundreds of data scientists over the past two years.” And I mean, now obviously they’re hemorrhaging, because they probably didn’t actually think about how to communicate. I think that’s where I would like to see the world go, is if we had better processes, just like we got through on the software engineering side, continuous integration and testing, good UX principles and things like that. We can build really scalable software teams now. Data science isn’t there yet.

Marshall Presser: No, it’s not.

Don Miner: And I think data engineers … the topic of the data engineer and data science thing though, is the tip of that spear.

Marshall Presser: Good point. Well taken.

Ann Spencer: I know that folks have to go, so I appreciate the time.

 

Share