In this Data Science Popup panel led by Michael Manapat, Product and Machine Learning at Stripe, we learn about practical applications of data science at Gusto, and instruction of practical data science at General Assembly. The panel members are Daniel Sternberg, Data Science and Engineering Lead at Gusto, and Kiefer Katovich, Department Chair for Data Courses at General Assembly.
Daniel: I’m Daniel Sternberg, and I lead the Data Science and Data Engineering team at Gusto. Gusto is an all-in-one payroll, benefits, and HR platform. Started with payroll, grew out to offering health insurance in 10 states as well as a full suite of benefits and HR tools for small and medium businesses.
I joined Gusto about eight months ago to lead that team. Before that—I actually got my start in data science. I did a PhD in cognitive psychology. I was studying human learning processes, a mix of behavioral experiments on undergraduates and computational modeling of the results in those experiments as well as neural networks for studying human learning processes.
I was finishing up my time in grad school, I really started to think about what I liked doing in the research process. What parts of my job at that point in time, you can call it, I really enjoyed and what parts were less interesting to me.
What I found was the parts I really enjoyed really aligned well with this new field that was sort of growing up around this time, which was about 2010, 2011, which was doing data science.
I joined a company called Lumos Labs. They made brain training games, Lumosity. I was the first data scientist hired there. It was a great opportunity for me because it was give me the chance to leverage my background in domain expertise for their product. But also to then get into a pretty fast paced startup, which had about 45, 50 employees at the time, and learn about more doing data science on the business side, in the product.
There were a bunch of different areas that I was able to execute in there. Then eventually to grow a small team there. That was a really great experience. I did that for about 4, 4.5 years.
Then I found myself wanting to work on a different set of problems and also work on probably a product that was very—I guess one thing that drew me to Gusto was a really concrete product that I wanted to work on, where they were very specific. It was serving a core need that lots of small businesses across the country have. The data set we were growing was really interesting.
It was an opportunity for—in this particular space, there’s a lot of opportunity for innovation and using data that you get from this employer-employee relationship, which we have a lot of data about, in a way that people don’t really have expectations about how data is going to be used in this type of product And we want to build. We want to really build new features that really help our customers be data informed. That was a really exciting opportunity for me.
Kiefer: My name is Kiefer. Daniel and I have a surprisingly similar trajectory. I got a master’s degree in psychology and neuroscience. While getting that, I was also working as a lab manager for about three years.
My original intention was to get into neuroscience, do a PhD further. But I realized, similarly to Daniel over the course of working in the lab that the parts I really enjoyed were the programming, the data manipulation, data science aspects. Funny enough, there existed this new domain called data science, and I was drawn to it.
I was the second data scientist hired at Lumos Labs. Daniel used to be my boss. Here we are at the same panel together. I worked there for two years before shifting over to working in education at General Assembly.
I’ve been at General Assembly for about a year as a lead data science instructor. I am just now the Department Chair of Data Science. That will be a different role where I’m more in charge of the curriculum management globally and the strategy going forward for what the data science immersive curriculum means.
Michael: Great. Let’s dive into what you actually work on. Daniel, Gusto and Stripe have a lot of similarities. We’re both doing analysis on event streams of financial data or event data. Can you tell me more about what you work on, what your big data science and engineering problems are?
Daniel: Yes, for sure. One thing, just to set some context. You know we’re pretty early at doing data science at Gusto, which is an interesting place to be, especially having worked at another company previously and built out a team there.
There’s a lot of opportunity to use data, because this company has four years now—has been around since early 2012, it’s actually five years now—worth of transactional data from payroll. Now we’ve grown out benefits over the last year and a half, and so there’s going to be some really interesting data there as well.
One thing we have in common with Stripe is there’s a fraud risk vector in our business. This is specifically on the payroll side.
So most of our small business customers use Automated Clearinghouse or ACH to do payroll. This is what you standard—how you get paid via direct deposit on your paycheck. It’s also how you pay your credit card bill online. You put it in your routing number and your bank account number. It’s a very low fee transaction. We can actually eat that fee.
Now the one thing that’s really interesting about this is we have, through our bank partners, the ability to transfer debit money from our customers’ accounts and then credit money to their employees through our account. What this means is that, if a bad actor who has a stolen bank account wants to steal money from that stolen bank account, they can try to use Gusto as a platform to do that. From very early on we had built out a risk engineering and risk operations team to make sure that the customers who are onboarding onto Gusto were legitimate companies.
And so we actually have a number of years’ worth of this data at this point. We have humans who actually have been in the loop on this for many years now, looking at all of these signals from our own data as well as from third party data sources that we pull. We get a lot of information when someone signs up as a payroll customer from social security numbers to EINs to a variety of other sources of information. We also pull all sorts of third party data.
So one of the projects that we’re working on right now is really to try to start automating part of that process. Also, even for companies that are sort of on the edge, and we’re not totally clear, there’s a lot of companies that are very clearly legitimate. There’s are a few, a small number of very obviously fraudulent cases. But there’s also a lot of stuff in between where we do want to human in the loop to really dive in and look into details and maybe ask for additional information from the company. But we also want to help guide them.
And so for us, using machine learning is helpful there to automate part of that. But it’s also helpful to help guide the decisions that the analysts are making as well. Because it is a complex problem that requires some sort of forensic investigation in some cases.
So that’s one area that we’re really focused on. Then another area that I’m really excited about longer term is what we can do with the data that we’re gathering to actually help our customers make better decisions.
So a really great example of this that we’re really excited about is how can we help small businesses, as they grow, think about what benefits they should add for their employees and how to stay competitive. This is something where we actually have a lot of information about this. Because we have, in addition to our benefits products, we actually have all of the deductions that are on all of the payrolls that our customers are running. And so we can actually see what companies of different sizes tend to offer to their employees at different points in time. That gives us a really unique opportunity to give this kind of information back to our customers.
And this isn’t something that people really expect in the HR space. But it’s something that really fits with our mission, which is about helping make work empower a better life for employees and employers and really own that employer-employee relationship. That’s something that makes a lot of sense for us.
Michael: Awesome. Kiefer, something we discussed before together is, if you think about machine learning data science and technology from 10 years ago, we were working on things like recommendation systems or click prediction. In recent years there’s been more of a trend towards using these techniques in finance, so things like time series analysis, forecasting. I’m curious what would you do in General Assembly, if the increased use of data science and finance is reflected in your curriculum, and how you’re adjusting sort of the changing trends, and how GA has applied it.
Kiefer: So we don’t specifically focus on finance in the curriculum as it currently stands. I do think that financial data is some of the most interesting time series and forecasting type of data. I think there’s a lot of interesting directions currently in it, particularly with, for example, recurrent neural networks to do forecasting better than classic techniques like ARIMA.
I think that forecasting is an interesting, almost historical type of analysis. When we talk about ARIMA and autoregressive analysis, you can start with the fundamentals and go all the way down to neural networks in that single domain. I think that it brings up a different topic perhaps.
I mean I don’t have financial data analysis experience in industry. But I do want to talk about this idea of how you need to teach fundamentals before you dive into really advanced methods. This comes up a lot in forecasting, where you need the students to understand statistical analysis, the basics of correlation, all of this stuff before they can dive into something like neural networks, which is the hottest trend. And understand how to validate what they find.
This is particularly important in financial data, where you may be making decisions that really have big impacts on people’s lives or on your company’s bottom line. When we train data scientists, you have to train them—and Daniel I’m sure will speak to this at length—to think like scientists and to be rigorous in the way they think. And to make sure that they understand and repetitively practice fundamental topics in data science.
I think going forward, I will focus even more on fundamentals of statistics and programming, so that students going forward coming out of the course will be more confident and more capable when they tackle more advanced projects in data science, like forecasting financial data with a neural network.
Michael: I want to unpack two things there. One is about the foundations and the second thing is about neural networks. But first, on the foundational piece, I have a lot of people at Stripe to work on machine learning on my team, for example. They’ll always say something like, I want to work on machine learning, but I don’t really have a math background. My response is always, I actually don’t care about that. At least, at Stripe machine learning is much more about engineering and about building production systems, because pipeline to future engineering, scoring production, these are all harder problems for us than the core of how do you optimize hyperparameters for random forest.
So I’m curious for both of you, Daniel, what kind of statistical techniques are you using everyday? What kind of foundations do you need? And what exactly are the skills that you want your data scientist engineers to have?
Daniel: That’s a great question. I definitely agree that there’s a base level of statistics that I think is important, especially if you’re at a company that also has an experimentation component to doing data science there.
So when we both worked together at Lumosity, we did a lot of A/B testing and multi-armed testing and things like that. In that domain, being able to set up tools so that PMs and other people in the company can understand that in a rigorous way and doing that correctly does depend on having a certain level of expertise, if that’s the type of area you want to go into.
But on the predictive modeling side, I definitely align with the idea that the training of the model and figuring out what those hyperparameters should be is not really that complicated once you get into it. But I think what’s actually more important is, and I agree that especially at a fast moving start up, where you may not have the resources to have dedicated engineering support to reimplement everything you do as a data scientist, having a solid programming background and ability to contribute to actual production code is really important. I think that’s going to depend on the type of company you’re working at and the stage that they’re at.
But I think one thing that you definitely do need is this—and Kiefer already alluded to it—is this sort of scientific thinking that I think honestly can only really come with experience. That’s what helps you, on the feature engineering side or even just the feature discovery side. Where you have a model you’ve built. You have some information about what’s important in that model. The next question is, maybe the question is how can I make the model better? What other data sources could I find that might be relevant for this problem? And so there’s a scientific thinking process I think that really goes into that in sort of having good intuitions about what the results of your model mean, especially when you’re using things like random forest.
Now neural networks, I honestly haven’t worked with neural networks since I was in grad school. I think part of the reason for that is that a lot of the problems I’ve worked on are ones where being able to go back to someone on a product team or another stakeholder, I need to be able to explain something to them about what I’ve found and what I’ve done and how they can use it.
So in that fraud example is a really good example of this, where we’re probably not going to fully automate that process today. But we’re going to do some automation, and we’re also going to give some insights back to them that they can use to make better decisions and maybe figure out what they need to investigate about a given company.
And so that’s something where using models that have some—that are slightly less black boxes. I wouldn’t call random forest not a black box at all. But you can get some insight out of it, certainly more than our current neural network—is really important in that case.
Kiefer: I think that this communication aspect is very important. I think that it’s a flawed sort of perception that is commonly popular these days to think that data science is all about optimization.
In fact, when you actually are doing data science work for a company a huge part of it is communicating what you found to other people. As Daniel’s saying, these black box approaches make it much harder for you to do that, and make your findings a little bit less believable to other people at the company.
Daniel: And I want to step in and just say that this isn’t to say that we don’t like the idea of neural networks. There’s huge problem spaces where they’re critical in showing huge value for a lot of different types of businesses and a lot of different types of problems. But especially when you’re building data science from the ground up in an operating company where that’s not the core business, the core business is not machine learning in and of itself, that’s sort of where it becomes harder to start with that type of model, I would say.
Kiefer: I like to call that Kaggle mindset, which is just like this focus on marginal optimization of some like target thing, like an R-squared, right? It’s like, oh, I can get like 0.01 better R-squared if I make my model a neural network instead of a random forest.
I think that sites like Kaggle, while they’re great, they promote this kind of thinking through leaderboards. I think that it’s, again, it’s a flawed perception of what data science really is.
You have to weigh is my model that much better that it warrants something that I may not be able to understand. In most cases, that’s not warranted.
And so I think that a big part of especially teaching data science is reinforcing that belief that you need to be able to talk to people about what you’ve done, right? It’s not just about reaching the first place.
Because when you’re working as a data scientist, it’s not a leaderboard. It’s being able to go to your CEO or the board and say this is what’s actually happening right now, and here’s my evidence for it. You’re not going to be talking about the design of your neural network and how many layers are in it, right? You want to talk about who is buying the product or what is causing profit to decline, that kind of thing.
Michael: Yeah, the Kaggle story reminds me of, I think, the Netflix competition. Where the winner of that recommendation system competition, the model was so complicated if I recall it correctly that they couldn’t even productionize it. Kaggle and Netflix, in the beginning these problems that are very distilled and are not actually sort of amenable to production solutions.
So one thing I ask people when I interview at Stripe for data scientist roles is where they are on the spectrum. The spectrum is on one end a pure data scientist, trained statistically, lives in R and Jupyter, but doesn’t touch production. On the other hand you have the engineer who’s very familiar with pipelines, knows how to do jobs and so forth, but maybe isn’t familiar with data science and the math. Needless to say, for the first few years of our existence, we were aiming for people who are right in the middle, who could do both jobs. I’m curious what your ideal profile is, how you’re hiring for your roles. Are you looking for hybrids or some of each of these two types?
Daniel: That’s a great question. We’re in the really early phases right now in terms of data science specifically at Gusto. We have a small data engineering team. I was hired as the lead for that team as well as the first data scientist. It’s early days in terms of that type of hiring.
I think right now, it depends on—it can depend on a few things in terms of hiring. The right answer is not the same for different types of businesses.
So for us at Gusto, because we’re part of the technical organization, also because we want to get a lot of these projects productionized relatively quickly, there is more of an emphasis on hiring people who can contribute to that production code base.
Now the scale of our data is such that there’s more flexibility, I would say, in the tools we use to do that than there might even be at Stripe. Because it’s big, but it’s not huge. That’s helpful.
But I would say, you know somewhere in the middle there is what we’re looking for right now. In my role at Lumos Labs, when I was hiring people there, there was a core R&D component and scientific component to that job. There was more of an emphasis on the statistics side than there was on the production coding side there. Because we had a large data engineering team as well, and so we could collaborate with them. It’s going to depend on where you are as a business.
I think that some of the much larger companies in the space now who have really rolled out data science at scale, so I would say like an Uber or an Airbnb, they have very specialized data science roles and those roles are deployed on to specific teams. I think that’s a point that most companies will get to eventually if they decide to invest heavily in data science, and they grow to be 1,000-, 2,000-, 3,000-person company. I think early on you’re going to get a lot further by finding people who are hybrids and strike a balance there, and coding skill is going to be super important.
Daniel: So a recurring theme that’s been coming up recently is that there’s actually a glut of people who have pure data science skills. All the PhDs who aren’t going to postdocs or academic roles, who are trained in statistics and math and not in engineering. That there are actually more of them than there are data science jobs.
I’m curious, Kiefer, if your curriculum is adjusted to focus more on engineering because of this hyperabundance of data scientists.
Kiefer: As a pure data scientist or at least closer to that, I am biased and have had to overcome that bias I think. The curriculum has definitely veered more towards rigor in programming and rigor in big data technologies. I think that will only be true even more so going forward. That the integration of the engineering side into the curriculum and maybe even replacing more advanced statistical methods is probably going to happen.
For example, my hiring is a little different situation, but when I am interviewing instructors to come on, I look for someone more similar to me who would teach a statistical side and someone more similar to, I guess my old co-instructor who is more of a strictly data engineering person and understands the intricacies of AWS and Spark.
I think that those jobs, those data engineering jobs are only going to grow going forward. I think that that kind of infrastructure work is going to be in high demand by all kinds of companies, big and small. I think a lot of the more pure data science work will be able to be handled, perhaps by a smaller team.
This is speculation, but it is a big reason why we want people to be so strong on the engineering component coming out of it. That you shouldn’t just go in and be like, oh, I know everything about statistics, but I can’t program a Python class.
Michael: Do you find that there are material differences in success rates after graduation from your program for people who have primarily engineering background versus a primarily quantitative background?
Kiefer: So I originally thought when I started that the people who would be most successful going to jobs would be the ones with heavy math backgrounds. It turns out that people who come in with really strong programming ability are able to absorb more, because it’s not an obstacle for them.
I think that’s also a facet of just the course being an intense 12 week boot camp, right, where you have all kinds of obstacles. If you can get rid of the programming one, you have more mental space to absorb the other material.
But I think if you’re strong in either one of those, you will show a greater success rate than someone who comes in with neither of them. That’s something I need to think about going forward for admissions, for example. Like, what are the requirements? What should we be looking for as a baseline for people coming in.
Michael: Awesome, thank you for both. We have about five minutes left. I wanted to open it up to questions from all of you, for either Kiefer or Daniel.
Q: What would it take to make your data science function 20 or 50% more effective? What would that look like?
Michael: The question was what would it take for your data science function to be 20 or 50% more effective?
Daniel: Well, right now, one would be more people. But in addition, I would say, and this is something that we’ve talked about a bit before in just earlier conversations as well. We sequenced this the right way at Gusto, I think, and this is really important. But it could be even better. Which is having a really strong data platform foundation in place before you invest heavily in data science is a really big enabler for your company. You can do that through like third party vendors and platforms like Domino. You can also do it through building your own tools internally.
It’s hard, it’s a tough balance to strike. Because a lot of the time, I think, with data science, you get a lot of the value of data from having the data science. But if you bring in a bunch of data scientists to start with, you’re going to be very inefficient. Because they’re going to be redoing each other’s work over and over again and writing random scripts to clean data. It’s a tough thing to sequence how you do it right.
So I think we sequenced it pretty well, where we got to a pretty stable place and started investing in data science. But there’s still a bunch of work there in terms of making it easier to automate the deployment of models and things like that, where having more data engineering support for that would probably make us much more effective.
Q: I’m curious how do the needs of your data science team change as you grow out to scale beyond just one or two people, and what are the problems that you’re looking to address as you approach the next stage?
Michael: So the question was how do the needs of your data scientists changed as you grow from with two people to the next stage and how you’re adapting to that.
Daniel: Yeah, it’s a great question. I’m lucky that I’ve worked, well, lucky—but I’ve worked mostly with small teams to date. This is something I think a lot about. This came up in my last period at my last role at Luminosity.
I think one of the big things that changes is you really want to have data scientists start working very closely with core stakeholders. There tends to be a move, once you scale toward this sort of deployed model of data science to—it’s a pretty standard approach. There are companies that don’t do it. I mean Netflix is a very different model, actually. They’d be a counterexample to this. But in general, I think one of the things that really changes is one of the reasons you want to work very closely with core product stakeholders, especially if you want to use data science in your product, is because there’s a lot of brainstorming that needs to happen with product managers about how can we really start?
Early days is like, OK, there are these core problems. This is sort of where we are right now. The core problems we’re trying to identify with—like help out with data science—are known core problems in the business like risk, like growth, where there’s a clear problem that we have a bunch of data about, and we could leverage.
Then the next phase is, OK, I have all of this data that I’ve gathered from building this really great business and building this really great product. How can I give that value back to my customers? And there’s a real sort of product brainstorming component you have to do there, to figure out what is the value of data in this business? And how can we actually find new markets to go into based on this? And that’s a real challenge I think you start to encounter once you get a little bit further along. It’s what I see us dealing with maybe next year.
Kiefer: Yeah, and the last thing you want is the data science team to be an island. Because you need buy in from the rest of the company. You need to maintain constant communication with all these other groups and understand what they need. They need to trust your work. In order to do that, you need to constantly be communicating with them and asking them what they need and what their problems are.
Michael: Kiefer, Daniel, thank you for being here. Thanks a lot. Thank you.