We recently caught up with Claudia Perlich, Chief Scientist at Dstillery. Prior to joining Dstillery, she won a number of data mining competitions, as well as several awards from industry and academia.
Claudia, tell me a little bit about yourself and also how you got interested in working with data.
I grew up in East Germany, which is interesting because there was no advertising there at all. So I didn’t see my first ad until I was about 15 years old. Early on I was very interested in math and decided to study computer science in Germany for a few years, after which I came to Boulder, Colorado as an exchange student and that’s where I took the first course on what is now known as data science. The course was on artificial neural networks and that’s when I got hooked. I really enjoyed looking at fascinating parts of data and stories, attempting to model it all. From then on I continued on that path; finished my degree in Germany and eventually came back to New York City for business school. I got my PhD in information systems, while primarily still pursuing my interest in machine learning, predictive modeling and data mining.
Afterwards I joined the IBM research facilities up in the Watson Center, known for building the Watson computer, which recently won a Jeopardy game. There I was part of the predictive modeling group and worked on a variety of things using data and predictive modeling, either internally for IBM or externally for their clients who came in through consulting engagements and had very specific data related problems. I also spent a lot of time participating in data mining competitions and I won a lot of those :). I’ve always enjoyed the exposure to very different data sets and challenging myself to build the best possible solution.
I find that I can disseminate some of the things that I learn and help bring my findings to medicine and other life-touching fields.
How did you get started with Dstillery?
In 2010 I was approached by Media6Degrees, now called Dstillery, a company in digital advertising space. They found me through my academic adviser at NYU, he recommended some of the techniques I worked on during my PhD to be used for targeting in advertising. They asked me to join their team and I have been acting as a Chief Scientist for Dstillery ever since. It’s a great playground to remind me why I love working in advertising, it’s really the ultimate opportunity to try different things and find out what works in data science and what doesn’t. We have incredible access to data and it has been both challenging and very rewarding. I find that I can disseminate some of the things that I learn and help bring my findings to medicine and other life-touching fields.
What was the first data set you remember working with? What did you do with it?
The first data set I worked on, wasn’t very inspiring for me personally, however, it was very inspiring from a mathematical puzzle view. This was probably around 1995, the heyday for artificial neural networks, which are now reworked into “deep learning”, but back then we didn’t have nearly the computing power. The project I worked on was using educational data; we tried to see if we could predict student success based on their grades and test scores. My most memorable learning experience was: you couldn’t beat a linear model even if we really wanted to. We really tried hard to make neural networks look good and we just couldn’t beat a relatively simple model that was performing at least at par if not better. In my industrial career I noticed that it’s always the simpler models that have an incredible amount of power if you use them well. After this experiment, I realized that good old statistics from the 50s could carry you a long way.
It’s always the simpler models that have an incredible amount of power if you use them well.
What specific problem does Dstillery solve? How would you describe it to someone not familiar with it?
Dstillery was built around the principle of using machine learning and predictive modeling to help zoom in on the right set of people to expose them to the right ads. When I joined Dstillery I didn’t have any experience in marketing, having gone to business school I knew the basics. However, the fundamental shift that data has brought to marketing is the ability to leave many “coarse notions” behind. Marketers tend to think and talk about their audience in a certain way, for example: “Middle aged soccer mom.” And the reality is, as much as I probably technically qualify, I hate to be classified in a group as a middle-aged soccer mom…! There’s a lot more to me, the facets of my interests are so broad that these very “coarse” descriptions of audiences that marketers have been using for the lack of better data for a very long time, is extremely inadequate.
The modern way of looking at demographics is figuring out, who do you really want to engage? Who do you really want to show an ad to? Can we have machines help us reflect the full complexity of human existence? We can find people who are clicking on ads, but how do we figure out how to engage with them and also change their behavior? This is what Dstillery is trying to do, look at fine grain information: the URL you looked at, the app you’re using, maybe the physical location I saw you yesterday and parse all of this data in an anonymous way, feed it into an algorithm and then say: “Given what I’ve known and observed you on this very granular level what is the probability that you are interested in this particular offer from this particular marketer — and also the ability to bring data and machine learning to say, “I’m only going to show ads to the top 1% of the population based on this ranking of interest in the product”.
What are the biggest areas of opportunity / questions you would like to tackle?
Another great area to develop technology in data science is around measuring causality. We need to go beyond the correlation vs. causation debate and say that we can actually access this, if a certain interaction truly has a certain effect and why. If these technologies turn out to work, we can bring them over to medicine and apply them on testing environments on almost an individual basis. If you think about personalized medicine, it’s very similar. There used to be just one medication for one diagnosis and if you look at what people are doing in cancer research today, there are specific medicines being developed for you personally – the medication that will work best for YOU. I see this across many different industries. We’re moving away from broad consumer descriptions into focusing in on the individual and what’s the best that we can do for them. This reflects in our ability to work with a lot more data. Dstillery does this for advertising, but the implications of the technologies we’re creating have much broader applications.
The implications of the technologies we’re creating have much broader applications.
What are your favorite tools / applications to work with?
We use a lot of non-SQL solutions such as Cassandra. On the other end where the data science lives we have a Hadoop stack with Hive on top of it, that provides the data team access to the data. We record every single piece of data, which hits our servers; the order of magnitude is around a terabyte. We want to give our data science team easy access to this incoming data so they can build models on top of it. Our whole system is pretty much automated. The modeling part is logistic regression with new techniques added. We usually pull from academic research and implement our own version of that to the product. The contribution of using this in the industrial setting is to really achieve robustness, and figure out how we can automate the whole process. We’re building thousands of these models a day. In the earlier days when I was a data scientist at IBM, I would have around three weeks to build one model and now we’re building thousands of models per week. I’m designing a process that builds solutions based on machine learning. Our data science team uses Python, R and Perl when prototyping and Hive and Cassandra to pull the data out of Hadoop.
What publications, websites, blogs, conferences and/or books are helpful to your work?
I personally enjoy face-to-face interaction, so I really enjoy conferences. My staple conference has been kdd.org, it’s a 20-year-old academic conference which has a very strong applied track. They have papers coming in from medicine, manufacturing, energy and advertising and I find it very rewarding to keep up with what the academic world is doing, and what the state of applications are in the industry. If you were looking for something more theoretical, you would go to NIPS, ICML, ECML. Industry conferences such as O’Reilly’s Strata and Predictive Analytics World are great for networking and state of implementation in the industry. There’s a new publication called the Journal of Big Data, they are trying to bridge the academic world around algorithms and analytics while also looking at the application side. Machine Learning journal, Machine Learning research, KDNuggets and Data Science Central all have some interesting articles.
Outside of Dstillery, what other interesting projects have you been working on most recently?
I share my knowledge by teaching at NYU and also by speaking at many industry conferences. Last year I was organizing a data mining conference in New York City with 2500 participants around the notion of data science for social good. We brought together nonprofits and matched them with individuals with analytical types of skill sets to help them solve important problems. As you know there’s a great demand for people with data science skills and there’s increasingly a lot of supply. However the big problem is actually matching those skills with the right jobs. As long as the title “data scientist” is not clearly defined, hiring managers aren’t really sure what skill set they are looking for and companies struggle to express what they want. There are very different models of that. For example, there are the Facebooks and Googles who want engineers with data understanding and on the opposite side at Dstillery we’re looking for mostly data understanding with some coding ability. A very different type of person would qualify for these two different types of roles. I’m passionate in connecting people to the right positions where they can strive and be influential.
Any words of wisdom for Machine Learning students or practitioners starting out?
Do something you really love to do.
There’s no point to do something because it’s popular at a certain time. Especially with data, there are so many things you can specialize in; it’s really up to you to find out what really gets you excited. For me it’s the detective game of finding out what’s going on in my data and the competitiveness of really being able to build the best possible model. You need to find a really good balance between what you’re really good at and what you love to do, and that excitement is what actually convinces me in an interview to hire you.
Claudia – Thank you so much for your time! Really enjoyed learning about the work you’re doing at Dstillery.
Follow Claudia on Twitter.