Data Science at The New York Times

Ann Spencer2019-07-09 | 40 min read

Chris Wiggins, Chief Data Scientist at The New York Times, presented "Data Science at the New York Times" at Rev. Wiggins advocated that data scientists find problems that impact the business; re-frame the problem as a machine learning (ML) task; execute on the ML task; and communicate the results back to the business in an impactful way. He covered examples of how his team addressed business problems with descriptive, predictive, and prescriptive ML solutions. This post provides distilled highlights, a transcript, and a video of the session. Many thanks to Chris Wiggins for providing feedback on this post prior to publication.

Session Summary

In the Rev session, "Data Science at The New York Times", Chris Wiggins provided insights into how the Data Science group at The New York Times helped the newsroom and business be economically strong by developing and deploying ML solutions. Wiggins advised that data scientists ingest business problems, re-frame them as ML tasks, execute on the ML tasks, and then clearly and concisely communicate the results back to the organization. He advocated that an impactful ML solution does not end with Google Slides but becomes "a working API that is hosted or a GUI or some piece of working code that people can put to work". Wiggins also dove into examples of applying unsupervised, supervised, and reinforcement learning to address business problems. Wiggins also indicated that data science, data engineering, and data analysis are different groups at The New York Times. The data science group, in particular, includes people from a "wide variety of intellectual trainings" including cognitive science, physics, finance, applied math, and more. Wiggins closed the session with indicating how he looks forward to hiring from even more diverse job applications.

A few highlights from the session include

Defining the data scientist mindset and toolset within historical context
Seeing data science as a craft where data scientists apply ML to a real world problem
The importance of data scientists having analytical technical skills coupled with the ability to clearly and concisely communicate with non-technical stakeholders.
Assessing whether a business stakeholder is trying to solve for a problem that is descriptive, predictive, or prescriptive and then re-framing the problem as supervised learning, unsupervised learning, or reinforcement learning, respectively.
Diving into examples of building and deploying ML models at The New York Times including the descriptive topic modeling-oriented Readerscope (audience insights engine), a prediction model regarding who was likely to subscribe/cancel their subscription, as well as prescriptive example via recommendations of highly curated editorial content.

For more insights from this session, watch the video or read through the transcript.

Video

Transcript

Chris Wiggins:

I have about 30 minutes with you. I'm going to try to tell you all about data science at the New York Times, and in case I run out of time my email address and my Twitter are here. Feel free to email me. If you don't remember anything else, just remember we're hiring. No, I'm just kidding. Well, we are hiring but... my talk is going to be sort of Pete [Skomoroch]'s talk and Paco [Nathan]'s talk sort of crammed into applications. Many of the things that I'm going to talk about today hopefully resonate with things that you heard either in Paco Nathan's talk or in Pete's talk earlier today.

As advertised, I split my time between Columbia and The New York Times where I'm leading the data science group. Oh, by the way, if you tweet at me I should warn you now I have sort of a 'Dark Forest' relationship with Twitter which means that I read but I don't write so if you're going to tweet at me tell me your email address so I can contact you on some other channel. Data science. Data science at The New York Times.

First, I have to tell you what we do ....when we do "data science" at The New York Times. Different companies mean different things by "data science". For example, Facebook a few years ago just relabeled all their "data analysts" to be "data scientists". When somebody says they "do data science" it could mean a lot of different things. Here is how we think about the mindset and the toolset of data science at The New York Times. Because I'm an academic I like to look at the original founding documents. Here is an ancient document about data science, specifically a piece by Jeff Hammerbacher. How many people here know Jeff Hammerbacher? Only a few people have heard of him? How many people have heard of Jeff Hammerbacher? All right. Jeff Hammerbacher started data science or data infrastructure at Facebook. How many people have heard of Facebook? Good. He went to college with a guy named Mark Zuckerberg. He went into finance for a year, then he was super bored. He called up Mark and said, "Mark, you got anything going on?" And Mark said, "Yes, we've got a lot of data. Please help us make sense of it." He went to Facebook for about four years, then he retired. When he retired in 2009 he had some time on his hands. He put together this lovely collection of essays. Therein he says,

"At Facebook we felt like different job titles like research scientist, business analyst didn't quite cut it for the diversity of things that you might do in my group. A "data scientist" might build a multistage processing pipeline in Python, design a hypothesis test, perform a regression analysis over data samples with R, design and implement an algorithm in Hadoop, or communicate the results of our analyses to other members of the organization in a clear and concise fashion. To capture this, we came up with the title 'data scientist.'"

I really love this paragraph.

It's a really fun paragraph to look at a decade later. Python has gotten sufficiently weapons grade that we don't descend into R anymore. Sorry, R people. I used to be one of you but we no longer descend into R. The Hadoop is definitely happening but it's Google's problem because now after building our own Hadoop on iron solution, after dealing with Redshift for a while, we now just gave it all to BigQuery. There's a distributed MapReduce happening in the background but we don't need to know. As far as we know, we feel like we're typing SQL. But communicating the results to everybody else in a clear and concise fashion, that is definitely something I require of people that I hire in the data science group. I want data scientists that I can drop in a room with somebody from product or marketing or somebody who does not speak calculus, and that speaks to a lot of the things that Pete said. I do not want a product manager being the information bottleneck between people who are supposed to do some research and develop a product that is useful and somebody who's going to be the end user. We want people who can communicate in a clear and concise fashion to the rest of the org.

The other ancient and canonical document for trying to explain to people above me what we do in the data science group is this one. How many people have seen this Venn diagram? Good, okay. This is a very data sciency crowd. You might even know Drew Conway from back when he was merely a graduate student. He was trying to explain what he was hearing at a Strata planning meeting when people were talking about data science.

I still believe that data science is the craft of trying to apply machine learning to some real world problem.

Again, the person from marketing or product does not come to you and say, "Please minimize this hinge loss with a polynomial kernel." The person who's a real live person has a real life problem like "I'm trying to figure out how to get more people to subscribe to my product" or "I'm trying to get people not to churn and leave my product". They come to you with some real world problem. It is your job as the data scientician to figure out how to re-frame the real life problem as a machine learning task, execute that machine learning task, and communicate the results to them in a way that is useful to them. Even better if you are what Eric Colson would call a "full stack" data scientist and you ship to prod, which means you actually push code that results in a live API, a live GUI, something that somebody else can put to work, and that is how data scientists really can add value in my experience. We may feel like those are the ancient documents of data science.

As Paco introduced this morning, there are some even earlier ancient documents in mindset and even with the phrase "data science". One document is from the heretical statistician, Bill Cleveland. Like many heretical statisticians of his time he was at Bell Labs. Bell Labs really evidenced the mindset of John Tukey, as Paco introduced you to this morning, and now I have several pairs of Tukey swag socks that I can give to my friends. Bill Cleveland and many other statisticians at Bell Labs were really living in the Google and Facebook of the day. They were a government-tolerated monopoly. They had all of the nerds, all of the computers, and all of the data about what all of you are saying to each other all the time. A lot of the lessons learned about how we can take data exhaust --- sort of a common phrase, Pete was sort of alluding to the way that a product produces a bunch of useful data --- they had it right. They had all of the telecommunications for this country which means plenty of other data, but that's another talk. Anyway. Bell Labs basically was the "data science" of the day. In 2001, Bill Cleveland writes this article saying, "You are doing it wrong." The people to whom he was speaking of course were not other people in the industry. He was speaking to academics, getting back to Paco's earlier point about the data science really being a mindset invented by industry. This was one of several such articles, but that's another talk.

We try also to help our executives understand the difference between words that are used interchangeably. People use the words AI, ML and "deep" interchangeably. We try to use them in different ways. You can of course have AI without ML, right? If you write a chatbot like Eliza there's plenty of AI here. "AI" is the aspiration. I want to make something that imitates a real intelligence. But it's not necessarily something that's going to learn from data. If you're into this kind of history, and hopefully Paco whetted your appetite for some of the history, I encourage you to look at other foundational documents like this great piece by Herb Simon. Herb Simon wrote this piece in '83 saying, "Who are all these people getting in the way of my AI and putting their ML in my AI?" Herb Simon of course was no fool. He's one of the few people to have a Nobel Prize and a Turing Prize, and his idea was, "All of you people are being lazy and you should just program it. Don't try to get the computer to program it for you." Any who. AI and ML are not necessarily the same. AI is an aspiration, ML is a method to try to get you some 'I'.

Let's get back to data science. Now that I've told you a little about how I think about data science and how we think about data science at The New York Times, you might wonder what does any of this have to do with The New York Times? That is a reasonable question to ask particularly if this is what you think of when you think of the New York Times. Here is a picture of The New York Times on its birthday in 1851, and for the vast majority of its lifespan this is pretty much what the user experience of interacting with The New York Times looks like. You may wonder where would you put a data science group in The New York Times.

To try to explain to you where my group sits, I have to show you the org chart of The New York Times. Here's the org chart of The New York Times.

Church. Church are the people who possess and defend the craft of journalism. State is everything else.

I am on state.

I want to make clear that I'm leading a team that does not produce those awesome infographics. Those people are great. We're friends, we're buds. Props. Not my people. We're not doing things with data journalism. We consult with the journalists, we try to be useful when we can be helpful but those people have deadlines, and I'm already submitting papers to NeurIPS and other stuff. I've got enough deadlines in my life. I don't want to write journalism. Instead we're doing things that you do not see.

We're doing things to try to help the business be economically strong. We are solidly on state and in particular, we are in a subgroup called data. When I showed up I was reporting to the head of BI who reported to CTO who reported to CIO who reported to EVP of digital products who reported to CEO. At this point data is a function that reports directly to CEO. In the five years that I've been at The New York Times...six, five and a half, data has become closer to a first class citizen and in fact reports to CEO.

Part of the goal of the data science group is to be useful throughout church and state, in that there are ways that data, in particular, can be useful to editors, to people in advertising, and to all sorts of different projects. I'm going to show you some examples of projects where we've put data science to work. A tagline for the group is "to develop and deploy machine learning solutions to solve newsroom and business problems." By that I mean you might have to expand slightly the machine learning method beyond what you find in scikit-learn. It's useful to know machine learning deeply enough that you know how to go beyond what's in scikit, and you might deploy it. Well, let's hope you deploy it, right?

Every now and then, we produce something that ends with Google Slides, but our goal is to produce something that becomes a working API that is hosted or a GUI or some piece of working code that people can put to work. That's how I think about the tagline when I'm talking to people outside the group. When I'm talking to people outside the group I also try to explain what things my group does and what other things the much larger group of data analysts do. One of the ways I frame that is, "Are you looking to build a predictive model? or a prescriptive model? or a descriptive model?" Different problems that I will show you today... I think about in those different ways. I've found that a framing that's useful when I talk to people outside my group, people from product or marketing or editorial, "are you looking for a description? a prediction of what's going to happen in the absence of treatment? or are you looking for me to help you decide on what is the optimal treatment in order to get the outcome you want?" Inside my group I say. "Actually these are unsupervised learning, supervised learning, and reinforcement learning, but don't tell anyone outside the group because then that will make their eyes glaze over and will think we're speaking some sort of Star Treky language." But this is something that people can sort of understand.

The other way to think about these problems is in the lifecycle of some sort of feature, marketing campaign, or something. Sometimes people will come to me and they'll give me a data set and they say, "Find me the NASCAR moms or some other segment." And I'll say, "Okay. You're really asking me to describe this data set back to you, but why do you want to find the NASCAR moms?" "Well, I really want to know which ones are going to churn or which ones are going to subscribe." "Okay, then you really don't want a description. You really want a prediction. You really want to know which one of these people is going to become subscribers." And they say, "Well, no. I want to know which ones I should give this offer to and which ones I show the blue button to." And I say, "Okay. What you really want is a prescriptive model. You really want...you really already know what actions you want to perform and you want me to help you figure out what's the optimal treatment, not what's going to happen in the absence of treatment." Getting people to go through and think through those things often relates to how far a business partner has thought through what they really want to do.

Now of course a lot of this looks like science.

If you spend any time in science you know the same rules apply, which is you build something, you build a prediction, and then you put it in the wild. It doesn't work and then that helps you re-understand what was wrong about your mental model of the data or what was wrong about your mental model of the people. Good. What are some examples of description? Let me show you some examples of problems where we've been able to build descriptions of data that have been useful.

One is this working GUI called Readerscope. Readerscope is a tool that tells you who reads what where. The "what" is a topic model. The "who" involves third party data that claims this person might be a decision maker or somebody who's a NASCAR mom or something like that. The "where" is geolocation. For us, you know, this started out as like a science project with some fancy topic modeling and then we showed it to our friends in advertising and they said, "Sweet. We can monetize that." Suddenly we had two project managers, two front end developers and a backend developer and it became this beautiful thing that no longer looks like a science project. Instead we show it to people from like the marketing department of some other company that's considering paying for the pixels of The New York Times and they say, "Who even are your readers and what even do they read?" And we say, "We can show you that.

Here is a tool for helping you understand who are the people, what are the topics they over-index with. You can on-mouse over these various topics and see words associated with the topics." It helps people understand who are our readers. Again, that is a description, though, rather than a prediction about what's going to happen, for example.

Then we can drill down and say what are the individual articles that over-index for that group or for that topic. We saw that you can understand people by the who, the what, or the where. Okay. That is an example of a descriptive tool.

What is an example of a predictive tool? Well, if you work with any digital subscription service the first thing you can imagine building is a model that predicts which of your users are going to become subscribers and which of your subscribers are going to cancel their subscription. And in fact, my very first day -- I ended up being at The New York Times by way of a sabbatical -- the very first day I was there...I still have the email that I mailed at 3:00 p.m. that afternoon to my then boss saying, "We really need a model that predicts which one of our subscribers are going to cancel," just like table stakes to have a model that predicts which subscribers are going to cancel. Of course if you get something that has high predictive power, that's nice. You can sleep at night as a data scientician and you know you're not building a random number generator, but the people from product, they don't want to know just that you can predict who's going to be at risk. They want to know what are the risky behaviors. Of course we strive to build models that are not only predictive but also interpretable.

Hand to God, I had this slide before Paco's talk today but this is a point well made by Leo Breiman. If you don't know about Leo Breiman, no longer with us, but Leo was a good example of another heretical statistician, a West Coast heretical statistician who showed his bona fides as a proper mathematical probabilist, wrote a fancy book and then walked away from a tenured job at UCLA to walk the Earth for a while doing consulting for a variety of clients including the brand-new EPA and a bunch of other people who were trying to figure out the difference between different foreign submarines and things like that. En route he built random forests and a bunch of other things. And near the end of his time with us, he went back into academia to Berkeley to try to explain to his fellow statisticians, "You're doing it wrong." And this is one of his papers about "you're doing it wrong" where he talked about the algorithmic culture that he was observing in the machine learning community versus the generative model community that was more traditional in statistics. He also, because he had done time in the real world, knew that sometimes people want to know what customers are at risk and some people want to know what are their risky behaviors. Sometimes people want a prediction and sometimes they want an interpretation. As he said here in this section, the Occam dilemma, "forests are A+ predictors". But he goes on to say "They're rated an F for interpretability." Long before the recent resurgence of interest in interpretable machine learning, Leo and other people who understood the balance between prediction and what people want were on it. On interpretability however decision trees rate an A plus. That's a good example of a time when the client wants prediction but also wants some interpretability.

The New York Times, unlike many digital media companies, still distributes the news in the form of dead trees. We ship dead trees to different stores and somebody has to decide how many copies of the dead trees should go to store number 1066 tomorrow morning, Starbucks number 137 or something like that. There are many ways you could do that. Hypothetically, you could use about 10,000 lines of heuristics encoded in COBOL on an AS400 machine from the 1990s. Or you could use science. We are now using science for that. Specifically, we have machine learning models that infer the likely demand distribution at different stores. Once you know the distribution then it is an old problem sometimes called the newsvendor problem. If you take a first year of graduate operations research classes, getting the most profitable allocation given a demand distribution is called the newsvendor problem. We are a newsvendor, so we have that problem. Moreover, we can use science in the sense of careful A/B tests, where we show different stores, allocations from different algorithms, and then I can scientifically prove to the CEO how many data scientist salaries am I saving every year by using science. Way at the deploy end are somebody's hands throwing a stack of dead trees into a store but way, way up high beyond the BQ is scikit-learn. Somewhere up there Python is allocating some number of papers that's eventually being deployed as a product in the form of somebody throwing dead trees. Good. Other fancy prediction problems.

Another problem The New York Times has is...associated with dead trees.... is putting ink onto dead trees. When a photojournalist takes a picture and then sends it to the editors of The New York Times, which happens thousands of times a day, some of those pictures are going to end up becoming things that are in print. The file comes out of a camera or a phone or what have you or a battery-free surveillance device built into the wall. Who knows? Then a very careful and patient editor has to go through and re-balance all of the color histograms, all the CMYK until it comes out to exactly the right photo balance because otherwise you get just a black square when it eventually becomes ink and it goes to the printing press.

En route in the form of exhaust we have an awesome before and after dataset of before and after a patient editor did that and now we can basically give them a warm start. We can say, "Look. Here's the picture...here's the file that came out of the photojournalist's camera. Here's our suggestion for how an infinitely patient editor is going to do it." And using deep learning we can even go beyond that and we can say, "Here's how editor number 12 is probably going to re-balance it," versus, "Here's..." You can actually learn different editor styles if you have enough data set. The awesome astrophysicist who did this has recently moved on and joined Google's AI group unfortunately for me, but he did a great job using Google's tools to put this thing to work. And it works. It's a visual problem so it works both in our MSE and it works by your eyeballs. But here's another one that works and it works and it works.

Another project that we built is this project that was a summer intern project with an undergraduate who worked with me at Columbia, then she worked with me as a data science intern, and we built this project that was sort of a science project to see could you actually predict how people would feel when they read different stories in The New York Times. The original thinking was maybe you could build a better recommendation engine or maybe you could do more sophisticated analytics or something if you knew not only the section of the story but you knew something like, "Well, this was a serious story or this was a really happy story or something like that." At first I was thinking, "Well, I fight with the data I have, not the data I want, so it would be nice to have that but I'm not going to make editors sit down and label stories, who's feeling happy, sad or what have you." But then I realized I could crowdsource it. We crowdsourced it. We actually got people to label a bunch of stories and say that they feel different feels, and then we used some deep learning models that would actually predict what feelings people would feel. For a story like this, "Cher has never been a huge fan but she loves being Cher," you can say, "Well, this is, you know, 'inspired' and 'happy' and 'amused' and a bunch of other feelings. Then it was sort of a funny summer project, you know, like a summer intern project for an intern who already had a full-time job waiting for her at Chartbeat so it was just sort of a lark to see if we could do it. And then I showed it to my friends from advertising and they said, "Sweet. We can monetize that."

Suddenly we had a full on backend team that was going to hit our API and then turn it into an ad strategy, and now you can go sell premium ads based on how an article makes you feel. The logic being instead of saying, "I want my brand to appear next to an article from, you know, travel," it could be that you want your brand to appear next to an article that makes people feel 'inspired' or 'hopeful' or 'adventurous' or something like that. And we can do that.

When I say "it works and it works and it works", what I mean is it works in the sense that we get accurate predictions on test data so I can sleep at night. It works in the sense that people pay for it. It actually works as a business thing, we are making data scientist salaries by doing this thing. It works in the sense that advertisers who use it actually see more lift when they target things more strategically. Like, you have an image that's associated with some particular curiosity, you should put it next to articles that are going to make people feel 'curious.'

The other cool thing about that is it doesn't involve anything about people. There is no PII involved here. I'm not doing any analytics about people and claiming that somebody is a NASCAR mom or a business decision maker who's interested in God knows what. I'm just doing analytics of the text. It's better contextual relevance advertising. It works and it works and it works. It's kind of fun. I hadn't crowdsourced before but you get funny effects like people would actually email you and tell you how they feel about the experiment. There's some fun computational social science to be done by figuring out, you know, how the different feelings relate to each other. Then the other thing that was weird for me is, you know, in none of my research in Columbia has somebody produced a pitch deck for the project afterwards but now people actually...in The New York Times there's a group that goes and makes this pitch deck that shows how things actually work. We just won an award for it last week at an advertising group. None of my papers that I wrote at Columbia had somebody make a little moving pitch deck like this for me before.

This is a conference where people...we also talk about how we ship things, so I'll say a little bit about how we ship things. When I showed up in 2013...there was pain. If you wanted data it meant that you wrote Java MapReduce and you hit buckets of JSON sitting in S3. Then when things got really civilized you could speak Hive or Pig. Then eventually you could fight with the people who were maintaining the Amazon Redshift cluster. Then somebody decided that a good idea would be for us to buy our own machines and put Hadoop on those machines, which failed silently and required you to get closer the metal and understand Java error logs. At this point, all of it is Google's problem, it's all sitting in GCP. The developers in the group, they write in Python; they leverage scikit-learn heavily. We have a continuous integration solution. For visualization we're not building our own dashboards. We're using Chartio. Everybody's happy with it. Then eventually the scheduled jobs are managed by Airflow and Cloud Composer. As an example, for this project in particular where we're predicting feelings, we also use TensorFlow...and heavily leverage scikit-learn as our solution. The feelings work in the sense that you can go and look and say, "Well, the story about homes in Minnesota made people feel happy; something about doctors making a misdiagnosis made people feel sad." It was kind of a reasonable set of solutions. Everything goes from the code base managed in GitHub via Docker to GCP, and then it becomes GCP's problem --- Google's problem --- which, again, like we're not going to out-google Google when it comes to that. So good.

Those are examples of description and prediction. What's an example of prescription? A lot of times people don't actually want to know what's going to happen in the absence of treatment. They want to know what's the optimal treatment. One good example of that is the audience development challenge. Audience development means when you're an editor and you hit publish, your relationship with the story is not over because then there's still an opportunity to make sure people get the story. Again, for about 150 years, the user experience was dead tree spread out on a table, and now people are experiencing the news in the palm of their hand. There's some strategy to be thought about in terms of how we push these things out on social channels. We built some fancy machine learning that would predict different levels of engagement --- for different definitions of engagement --- based on whether you promote or did not promote on different social channels. That was great, but then how do I get that in the hands of the editor? Well, it turns out the editor is not going to fire up Python. We spent a while thinking about what sort of web app we could build, but from previous experience editors are not going to stop what they are doing and open up a new web app despite our awesome development skills in Flask.

However, in 2014, 2015 the editors were falling in love with Slack. Josh [Willis] from Slack introduced you to Slack yesterday. Then those of us who are not particularly good at interfaces other than command line are set, because we just need to think about how to make our machine learning readable, legible that is. Editors can interact with this bot. The software engineer became a product person, descended into the news room, watched them do what they do, did all that sort of product research, and then built this bot that people could interact with. You can ask it questions like "what should I be posting on Facebook?" or you can ask it to interrupt you and say, "Seriously, you should be posting this on Facebook."

Another example of a prescriptive problem is for parts of the content for which there's personalization, how could we do this? Well, we probably don't want our editors deciding for every single election story, for example, is this election story of interest only in Ohio or is it of interest in Ohio but also Alabama? We'd rather just let go and let data ...we built a contextual bandit solution for a small section of content: content that had been blessed by the editors as relevant to the elections; content that was going to be put in a controlled widget that was like an election box and then we could let go and let contextual bandits. In the interest of time, I'm not going to teach you contextual bandits. It's like doing A/B testing except you never have to take a meeting afterwards. So contextual...like just think of bandits, multi-armed bandits as the meeting killer. Instead of having a meeting afterwards to say, "Oh, I liked B but this other person likes C, and then the CEO likes option D," you just let go and let bandits, and you never need to take another meeting because the code dynamically upweights the thing that's winning for the KPI that you're interested in.

There's many ways to do bandits. In the interest of time I won't tell you much about this except that our favorite is the oldest. The oldest is a technique from 1933, which is super fun, when a bunch of people are beating their chest about their latest PAC bounds, and their algorithms, and then the thing that wins is from 1933. By the way, here's a paper by some researchers at Yahoo back in 2011 where they benchmarked their fancy algo against this 1933 algo and the 1933 algo wins, which is pretty cool. If some of you are academics and your papers aren't getting cited, don't worry. This is the citation count for this paper which was published in 1933. Nobody paid any attention to it whatsoever until 2011. Suddenly it's like "Bam! That's the bidness!" It's just, you just need to wait 80 years and then your paper will be heavily cited. Yeah, and now you can get this thing as a service. I won't tell you about fancy math that we did to extend it. Instead I'll just say a little bit that tries to echo things that Paco and Pete said about things that we've learned about getting things done. The main thing that we've learned to get things done, and this is really stolen from the military, is that if you want to change culture you need 'people,' 'ideas,' and 'things' in that order. This was a quote from US Air Force pilot, John Boyd, that we really...we weren't able to get anything done until we had buy in from the right people, that they were willing to let us experiment and try things out. All of us get really excited about various 'things'.

I put this in before Pete's talk this afternoon, but Pete made reference to Monica Rogati's hierarchy of needs. You know, when you show up and you're the first data scientist you can do some sort of provocations by doing some fancy AI and ML, but if you don't have your data infrastructure correct then it is really difficult to actually impact process. In the beginning, five years ago we were sort of doing provocations and building our own Flask apps, but actually integrating into other people's process required the company to really level up its all data infra, which they have.

In the interest of time I won't talk about much this except another thing to get back to my earlier snark about Facebook is that in academia we tend to conflate all of these jobs, data governance, data analyst, data scientist, data engineer, but The New York Times sees separate titles, separate groups. We try all to play well with each other including data governance which is something Paco mentioned. We have a very good data governance group. A part of how we've done that is to hire people from a wide variety of intellectual trainings. People in the group at present are coming from physics, finance, applied math, double E, cog sci which has been really useful with the crowdsourcing among other things, and we're still hiring. We look forward to hiring people from even more diverse job applications. And with that, I have two seconds which leaves me time for questions.

Thank you very much.

[Editorial note: this transcript has been edited for readability.]

Ann Spencer

Ann Spencer is the former Head of Content for Domino where she provided a high degree of value, density, and analytical rigor that sparks respectful candid public discourse from multiple perspectives, discourse that’s anchored in the intention of helping accelerate data science work. Previously, she was the data editor at O’Reilly, focusing on data science and data engineering.

Summary

Data Science at The New York Times

Session Summary

Video

Transcript

Other posts you might be interested in

Insights and perspectives from NVIDIA GTC 2025

Top AI transformation drivers in the public sector

Evolving AI architecture to meet rising AI demands