‘Lean’ Data Science

data popupdata science

In this Data Science Popup session, Noelle Sio, Principal Data Scientist at Pivotal, explains how to apply Lean methodology to data science.


Video Transcript

I'm Noelle. I'm a data scientist at Pivotal. I actually run the data science team on the west coast. And I've been a data scientist—I think I've had the title for about seven years or so—but have been a data scientist for about a decade.

What I do at Pivotal, so for those who are unfamiliar with Pivotal or haven't heard about Pivotal Labs, we basically are consultants. Over the past several years, I've worked with a lot of different companies trying to get started on data science or helping them where they don't have a data science team.

In that time, I've had varying degrees of success in terms of data science projects. What I learned very quickly about success in data science projects is not simply a result of how good your data is, how well the problem is set up, and what kind of magic math I can do in the background.

But really, the way that I measure success is how effective the change was. Did we actually solve the business problem that they actually asked us to do?

My title of my talk is about lean data science. As part of Pivotal, we have people at Pivotal Labs who embody the lean methodology.

So has everyone heard of lean? Yes? OK, good.

We wondered; we do data science this way. There are people over there who have best practices for building software, and doing things, and disrupting the way people build software.

I wonder if there's anything that we could learn and do data science differently. Not just machine learning, not just statistics and all the stuff that gets the glory, but the actual practice of doing data science—could we disrupt that a little bit? Are there things to learn?

This is a work in progress. I'm not claiming to have figured this out. But probably about three years ago or so, we decided like let's experiment with this. Let's see what we could do and we can learn.

So, some level setting—I think everyone in this room will agree with this statement. Data science is awesome. It's great. And why not? Just because we all believe the marketing hype and we're all unicorns.

But this statement—data science can make blank better. It kind of is magic a little bit that you hear like data science, if you just apply some data, can make something that you're already doing better. If you're just running things on assumptions or the finger in the wind, like if you apply data to it, you probably could make it better. Or it could revolutionize and start new things that you didn't know were possible. Think of like where would we be without Netflix, right?

Tons and tons of things are now available to us because we use data. And by better, things that are better, is that they're actually more efficient and they're more effective.

If data science is so amazing, why isn't everybody doing it? It's a lot of work, right? If we've learned anything from these talks here today at this conference, it takes a lot to make data science successful.

There's a huge laundry list and there's a ton of things out there like if you want to get started on your data science team, here's what you need. And it comes with this huge checklist.

You need data. You need data scientists, which are expensive and hard to find. You need technology. You need a data-driven culture. You need organizational alignment. You need a partridge in a pear tree. You need all of these things in order for data science to be successful in your organization.

Even if you have all the things or a subset of them, you can still struggle with making data science successful. I don't know if some of you have seen the articles that have come out lately where people have said, oh why am I not getting ROI?

Like “I did the thing. I get all these things and I hired a data scientist. But nothing's happening.” Like “Why don't I have any ROI for my data scientist?”

It's because we have to set these things up properly. And there's a lot of challenges that we didn't think of before. Isn't there a better way to do this so that we don't just build up this hype for all these data scientists? And suddenly they're not effective and people don't get ROI.

I think I saw it flashed earlier. But they know that people have subscribed to—everybody heard of CRISP? CRISP-DM, a way of looking at or solving data mining problems. I think a lot of places do data science this way.

It's very waterfall. You have a project. You set it up. You put the smart people on. And a couple months, you come back later and you're like, “Oh. I expect a solution. Ta da.”

But can we disrupt that process? Can we define what that is and then disrupt it? I had to think about this. What if there's something that we can do better? Because we're data scientists. We experiment. This is what we do all day. Can we do data science on doing science? Can we do it better?

First, let's think about what data science is. What is it that I'm making better? In your organization, where does data science sit?

I think me, as a data scientist, I've reported to pretty much everybody. You name a part of the organization from R&D to marketing to product to finance to sales. You name it, I've been a data scientist in that part of the organization.

When you think about what a data scientist does day to day, what is it more like? Is it more like solving R&D, or is it more like building software? In R&D, the way I think about this is they said, “I have a hypothesis I want to test.” Then the whole project is based off of how well you can prove or disprove that hypothesis. Your results are very much focused on the actual model that you're building.

How well did I get my model to work? How well did my algorithm actually do? What's my AUC? Those are the numbers. But the problem with this—and I think actually that part is a key part to pretty much all of our projects—is that there is no path to implementation. It just kind of lives and dies by that.

Who do you report those results to? Do you tell the CMO, “Hey look. I did a great thing. Look at my AUC.” You don't tell the CMO that. So when you have R&D, this was the most frustrating part for me for being in R&D is that's where a lot of your projects just die in PowerPoint or on your desktop.

Is data science really more like building software? The success here is measured by the impact where the focus instead is on the decisions that are being made. Are we changing how businesses run?

I've come to the conclusion that data science is more like building software. Is there a way that we could be doing this better, especially since a lot of data scientists actually aren't software engineers?

As I mentioned, I'm part of Pivotal. I looked over at my colleagues who are in the app development side of the house, and dev, designers, and product managers. And I thought, “They know what they're doing.” And what is this hype about this book, The Lean Startup.

Could we do lean data science? Could we disrupt the way that we're doing things? And so, in the past three years, there's a lot of buzz words. There's already a lot of buzz words when it comes to data science. There's a lot when it comes to lean. And so there's a huge laundry list.

Maybe you're doing a disservice to claims, then the next 10, 15 minutes I'd go through all of them. So you probably heard things about pair programming, test-driven development, CICD, iteration planning meetings, backlogs, et cetera, et cetera.

I don't want to go through all of those. And I certainly—as I said earlier, we're not done experimenting with this. But I want to examine instead how experimenting with this has helped with a few key pain points that I think everybody has.

One is this, the silver bullet. Again, who's been the data scientist that you've hired? And you said, OK, all we need to do is hire data scientists. The data scientists will come up with this magic answer. Ta-da, that's all I need. I need a data scientist.

Doesn't work that way at all. I want to dispel the myth that data science is the same as having a magic silver bullet. One of the lessons—and again, I think a lot of people are probably familiar with this cycle of build, measure, and learn.

This is something that I think everyone's kind of doing all—everybody is doing already when it comes to the mathematics of the algorithms itself. You don't ever do build a model once. Nobody ever does that, unless you are completely out of time and you literally only can run it once. You never do it once.

You build something. You measure it. You start over. You iterate again. We've seen a lot of people do this. But I think what happens here is people tend to be very limited in terms of what they tell outside of their organization.

How often have you run a model and gone off, done something, come back, and the business stakeholder says, “I can't use those results. They're based off of blank.”

Or, you come back and find out, oh, there's something wrong with your data. Turns out you spent all this time cleaning it and doing all this stuff downstream, but, hey, we're working off of the wrong data set. That's because you're not getting feedback early enough or often, right?

So build, measure, learn is about—for me was about, OK, how do we expose our process earlier and often? How do we actually get feedback?

I want to start this notion of what is a minimum viable model. What is the—the way that I define it, what is the smallest thing, smallest model that you can build on the data that you have and the question you're trying to answer to just get something out the door, right, and not necessarily to push it out to production, but just to get some learnings, to socialize it with all the people and the stakeholders outside of your group. What is the smallest thing that you can build? All right.

Especially if you're building something into software, do they need a really, really accurate model to build that software? Think about it.

If you want to say if Amazon wants to sell books, so they said, “OK, what are the books that—I just need to show a list of books to recommend people.”

They can just recommend Harry Potter to everybody. That would be an MVM. It would be terrible, but it would be a model. Or, they could just get the New York Times bestseller, right? They can do something in there that just gets proof of concept going. And you could start through the cycle again.

I think what data scientists tend to do, especially the younger ones, tend to go for the most complicated solutions first. They get caught in all these traps trying to do that. And so, instead, let's start with the minimum viable model and then build up for something a little bit more complicated. Overall, the benefits of sharing early and often will outweigh the costs, I think.

What I think is making data science successful is a shared responsibility. It's about teamwork.

The first pain point I mentioned was about the framing of data science problems itself. The second is a little bit about people.

Now, this guy right here. Everybody in the organization has this kind of person or at least has dealt with this kind of person. Everybody has a Sheldon Cooper type.

I like to dub them like the alpha nerd problem, right? And so, you know them. And so, they're off the charts really, really smart. And they can whip up really, really crazy, innovative solutions by themselves. And then they're also just a tiny bit eccentric or maybe a little bit, or a lot bit eccentric depending on who they are.

But there's a few problems with having alpha nerds. One is they tend to make solutions that are so complex that they're the only ones who understand them in a language that they only know and that nobody else can touch them.

Then they like—and they like to be lone wolves making that crazy solution that nobody else understands. And by doing that, it means they don't take feedback into consideration. They don't really work with others. They don't tend to teach. That behavior tends to alienate other parts of the organization.

When you're trying to put things into production if you have this alpha nerd that nobody else wants to work with, guess what? It's not going to work. That model is going to die because nobody wants to work with that person.

This guy is the opposite of being collaborative and communicative, which at the end of the day is what makes, I think, the lean methods work so well is having that communication and having collaboration because no person is an island.

Again, so I'm going to walk through a few more tactical things that we've tried to be more communicative and collaborative. Someone once told me—I hear often actually that the methods that lean and agile are about are turning what you do that are good and turning it up to an 11.

So if you think about code reviews, hey, code reviews are great. People should look at your code, more than one person should read your code. What happens if code reviews get turned up to an 11? So you're actually reviewing your code all the time. And that's what started pair programming.

Pair programming—who's done pair programming? Oh, old news for everyone here. But if you haven't—so you can imagine one computer, two people, two monitors, two keyboards, two mice, working on the exact same thing at the exact same time. And you can pair in different ways.

I've done it as a consultant, so myself and then the subject matter expert. I've had it as a data scientist with another data scientist or even a data scientist and something like an engineer or a designer, et cetera, as those mentioned in another talk earlier.

Some advice on how to do this, right? So this is great. It sounds great. Oh, yeah, we should all do data science. And we should all pair with it. The vast majority of data scientists have probably never paired. They've probably not even gone through a code review. That probably mortifies most of them, if not all of them. I know because I've asked. And they've looked at me like, no, I'm not going to do that.

Our organization had a lot of resistance, but eventually everybody tried it. And eventually, they realized, “Oh, there's so many things that you can do when you pair with someone who knows something that you don't.”

You learn more. You learn quickly. You can get information. You can transfer knowledge. And you can divide and conquer even.

What doesn't work trying to do pairing? One, first of all, is the blind leading the blind. So two people who've never done this before said, we should just start doing this, won't necessarily work.

Set up some guidelines in terms of how you do it. Ease into it. Maybe don't pair all the time. Pair for one day or a few hours at a time. Don't try pairing in a conference room. Not the same. I know this seems kind of trivial, actually on your keyboard in a desk with a pairing station, works way better when people actually feel comfortable and not just projecting their code on a screen, I think is a little scary. And then mobbing.

I learned this term, they're, oh, that's great. Pairing is great. Why don't we just make this better by adding more data scientist to it or more people who want to be data scientists? So instead of one person, let's have five people look at a monitor at the same time. No. Don't try it. It's way better to rotate through your people than it is to try to scale the problem that way.

The second is adding more communication, not necessarily adding more meetings, but communicating more. So a daily scrum—so for those who are unfamiliar, it's a daily check-in meeting where you can go over what you did yesterday, what you did today, and if you're having any blockers or need any help.

Then a retrospective will be at the end of the week where people can go over for an hour what went well, what didn't go well, and what's going eh, and are there any action items coming out of it. So that seems like a lot of meetings. I just added six more meetings to an already busy schedule.

But what this does, it replaces the meetings that people typically have. So usually there's like checkpoints every few days or every week or every couple of weeks. But what tends to happen is that's the only time people will communicate progress. And then if you have a problem, it'll just fester, right?

So it's like, oh, I had a problem. Oh, I'm not going to—they don't respond to email. And I'm not going hit I'm going to see them in a couple of days anyway, so it can wait. But those things really accumulate over time and slow down the progress. So add these to your routine. But you don't have to do it all at once.

Again, like stand ups, you could have—and I use scrum and stand up interchangeably. You can have stand ups two times a week or three times a week, see what works with your team.

Some advice though with the retrospectives, which are great because people can air out the things that are going wrong and knowing the stuff that's going well boosts morale. Don't let it turn just into a venting session. That helps nobody. It feels really good at the time, but helps nobody if there's no action items coming out of it.

Empathy—I know that they talk about empathy a lot in our product organization. And I think they—and this is usually intended for having empathy for external users like, oh, who's going to use the app? Have empathy for them. But I think empathy goes a long way internally.

I think it's, in fact, key to making collaboration work. And a colleague of mine had said, we are scientists and not experts. And I think that really is at the heart of what data science actually is.

We don't come and have all the answers up front. We don't. We're not going to have a magical thing in the back of our pocket. We do know how to run tests and experiments and the best way to get there, right?

People always ask, “How can you guarantee data science will have success?” I'm like, “Well, there's no guarantee in life really, but we know the best ways to get there.”

We can tell you what the data says. And data scientists aren't an island. As we saw, there's a lot of other people who help data science projects get off the ground in order to be effective. So have some empathy.

It's interesting too I've noticed that a lot of other professionals in organizations like BI analysts, business analysts, will say, “Oh, data scientists are over there. I don't even talk to them.” Or, “I'm afraid that they're going to take my job.”

You come into the room as a data scientist. They don't want to work with you because suddenly you're making them irrelevant. Or, I have data, but it's not that big. I don't want to bother you with my problems because it doesn't warrant—the data scientist doesn't need to look at it.

I think honestly we could all do a little bit better in organizations of not being this elite, unicorn data science that nobody can approach, right, really have some empathy. Some ways to do that is break down the data—the parts of your data science project that are relevant to people, so they can understand it.

Other people in the organization are your fans as well. They want to be able to say, hey, this model did really great. I love this product. I love how it works.

I can tell you sort of the moving pieces. I don't need to know all the math, but being able to actually say what data science the models are doing, it actually goes a long way. Data science, again, is a team sport, not quite—that's not the only part that should get the glory.

Some advice like what doesn't work here is only promoting that part of it. I think there's a lot of myth around like, oh, it's just about the algorithms. It's machine learning. It's this. It's that. And the data science is really more about the fact that we solve problems with data. I know it's kind of a fine line, but I think helps dispel that unicorn silver bullet myth.

Some final takeaways, problems can all have simpler solutions, right? Don't go for the crazy stuff first. Start small, iterate, work on it.

Two, more communication is a good thing, not more meetings, more communication. So make sure that you're always getting feedback in terms of what you're doing. It's way easier to correct your mistakes up front when they're small before they snowball into big things. And three is that we're all in this together.

By that, I mean not only is data science like a team sport internally, it also means that we're all in this together in this room. We're all pushing the boundaries of what data science means and does and how awesome it is.

I love coming to the events and hearing about what everybody is doing. And I heard—one of my team members told me this great quote yesterday by E. F. Schumacher that said, "Any intelligent fool can make tiny things bigger and more complex. It takes a touch of genius and a lot of courage to move in the opposite direction."

If you think about the way that data science is going or the types of solutions that people are building like you don't need to build this huge, crazy, monolithic thing.

Sometimes simplicity is a little bit better. So reality check, remember change isn't easy. Start small. Don't try to boil the ocean. It's still an ongoing learning process for myself included. And a huge shout out to my team who have allowed these experiments to run and are willing to roll with the punches. All right. Thank you very much.

I think we have a few minutes for questions if anyone has any.

Q: Do you often have to deal with the fact that people expect data scientists to come up with really complicated models and if something is really brief, they get disappointed]?

Noelle: Yeah. I think so. I think if people are disappointed with, oh, this isn't crazy. This isn't complex. It's just regression. Well, regression gets you a long way. In the way that the advanced machine learning gets a lot of hype, I think things like regression get the shit end of the stick, right?

Because people are like, oh, it's just regression. Hey, man, regression is still cool. Regression can still solve a lot of our problems 80/20. And something to also say too is that the more complex the algorithms are and the more esoteric it is, the harder it becomes to support that, right? How well tested is that code? Who else can run this algorithm, right?

If it is the MVM, the first thing you run, give it time. Dude, you had one week. Don't expect me to come up with some crazy algorithm, just in a week.

Q: My question is about the feedback. How do you filter which ones to implement and which ones to throw away? Because people can give tons of ideas. I mean, I just started—I just finished a project with data science and during my project I used to get tons of ideas. And it was really hard for me to discard them.

Noelle: Sure. So how do you decide on feedback? Yeah, So first of all, it depends on who are the sources coming from, right? And so, usually two things, one, it's always helpful to have another data scientist or somebody on the product side help with that, help you prioritize.

Is this something that is actually going to make the model better or the product better? Or, is this something that they're just picky about? And if it's something that they're just picky about, is it the person who is the main sponsor of the project, right?

So there's all of these vectors along which you can prioritize. So there's no one right answer to that. I think all of those things come into play as well as feasibility. So it's always kind of a push and pull between all of those things.

Q: Question for you, this one is since you mentioned like, you tried to follow scrum. How long are the sprints?

Noelle: So typically, we try to do one week, one to two week, one to two week sprints.

Q: When Pivotal goes into an organization and setup the organization and data science and implements scrum agile lean methodology—I asked a question early this morning and I was told that it's really hard to do that with data science. So my question to you is as opposed to doing a typical ERP or CRM application or something like that, what makes it harder or more challenging in data science to setup springs and pair programming and all the other methodical aspects of agile?

Noelle: Yeah. So again, this is an iterative process. So whatever the state that their organization is in, we don't do all of it at once, right?

For our experimentation, depending on where they are, we can add one or two things. Like, “Oh, hey you've never done pairing before.” Let's try that. Or, we need to—you're already doing sprints, two week sprints, this is the way that we do sprints. So it's not going in and for the data science side completely overhauling how they do data science already because that's not going to help with the adoption process.

But if we're going in there and they are already doing agile and lean for their engineering and we're building something where we have data scientists and devs together, it's easier to do that. I don't know if that necessarily answers your question.

Q: So a little bit of a continuation from the last question. As you're looking at data science, is there any key aspects of agile and lean that you see that don't work or that you or that you have to modify?

Noelle: Yeah. I think the most challenging ones are the ones that are very code driven, so CICD, TDD, those are ones that we've really—our test-driven development, continuous integration, continuous deployment are harder because not all data science projects are setup to be part of a product or are already set up to have that sort of data pipeline—or to have that pipeline.

Those are the ones. We also ask like, “What test do we even write?” How do you write a test? How do you write a story? There's a lot of things that we struggled with.

And part of it is are we trying to adhere too closely to agile in a way that's actually not beneficial? I can write a story in a backlog that says, I make a model that predicts x. And I've now summarized my entire 12-week project in one line. That doesn't work, right?

But then at the same time if you write stories so small that it's literally like I need to run this iteration this, I need around this iteration that, then all you're doing is opening and closing stories all day off of your backlog. And then that doesn't work either.

So, as I said, it's totally a work in progress for all of the individual and parts. But I think ultimately trying to hard is also what makes is the part that fails.

If you stick to closely, I need to do agile data science, which means I have to do everything in the agile manifesto to a T and just insert data science where I see software engineering that's the part that I think people have to realize it won't fit that way.

Any other questions? All right. Well, thank you very much.