*This is a Data Science Popup session by Hristo Spassimirov Paskov, Founder & CEO of ThinkFast.*

# Summary

Machine learning has revolutionized the technological landscape and its success has inspired the collection of vast amounts of data aimed at answering ever deeper questions and solving increasingly harder problems.

Continuing this success critically relies on the existence of machine learning paradigms that can perform sophisticated analyses at the data scales required by modern data sets and that reduce development cycle times by improving ease of use.

The evolution of machine learning paradigms shows a marked trend toward better addressing these desiderata and a convergence toward paradigms that blend “smooth” modeling techniques classically attributed to statistics with “combinatorial” elements traditionally studied in computer science.

These modern learning paradigms pose a new set of challenges that, when properly addressed, open an unexpected wealth of possibilities. Hristo discussed how ThinkFast is solving these challenges with fundamental advances in optimization that promote the interpretation of machine learning as a more classical database technology.

These advances allow us to scale a variety of techniques to unprecedented data scales using commodity hardware. They also provide surprising insights into how modern techniques learn about data, including a characterization of the limits of what they can learn, and ultimately allow us to devise new, more powerful techniques that do not suffer from these limitations.

# Slides

# Video Transcript

Thank you for having me today. I recently started a company based on my PhD work on massive-scale learning. I suppose this talk might be a little contrary to using the same methods. It’s all about looking forward and to the scales of data that we have.

What I want to start with is actually something that bioinformaticians love to tease computer scientists with, which is the deluge of genomics data that we’re getting today from advances in whole genome sequencing.

This is really a remarkable transformation that, in the medical community, they’re hoping to make part of the standard of care as a Precision Medicine Initiative for everybody. I think this is actually a real opportunity to tackle some of the major challenges and diseases that we see in medicine and have a whole new perspective on how to attack them. Before we get there, there is, really, a problem.

If you speak to a lot of the researchers and professors involved with these efforts, they’ll admit to you that, right now, they don’t have a good solution for how to store this data, let alone have the tools to analyze it in deep and meaningful ways. I think, coming up, we are going to see a massive crunch in terms of what we can do, in terms of the scale of what we want to do.

This isn’t just really getting into bioinformatics. I have a bit of a sample bias, I suppose, but I go around to lots of companies to discuss their massive data problems. One of the consistent themes I hear is that, “Gee, I have all this data, I wish I could deploy my sophisticated analysis to it, but it just doesn’t scale. I can’t access all of my data together.”

I think this is actually holding back quite a bit of progress. So in some sense, machine learning and statistics has been very successful in this modern era of getting people to go out and collect data, and try to solve really hard problems with it. But I think, right now, at this point, we are lacking in tools that can really scale to the promises we’ve made. To get at this a little further, I’d like to look into my data analysis state diagram.

The first part of this—and this is taken from my own experiences with modeling data—it says, given a bunch of data, hopefully, I get to ask for the specific data that I want, but usually, it’s just making do with what I’m given. I’ll decide what pieces of the data are relevant for the task at hand, and maybe consider which data sets I want to try to link up together in an effort to achieve my goals. So far, so good.

I’ll start simple. I’ll use very, very simple models that are scalable. I’ll use these to explore the issues with my data, and hopefully, the successes that I might have with it. But as the refinement cycle continues—for instance, I think this is really evidenced by some of the competitions out there—you start to hone in on what it is that are the nuances of your data. And you start to want more.

You start to want to model it better and better and more accurately. You start to want to pull together different kinds of data sources. And this is all fine and good, provided there’s a package out there that can actually achieve this goal. But I think a lot of these large-scale projects inevitably end in this quagmire, where now, they want to apply a new kind of learning methodology. Nothing exists that scales appropriately, and now we’re stuck.

You might turn to the optimization community, but you’re looking at, potentially, several years of optimization work to figure out how to scale this to what you want. That’s, generally, in academia, a fine thing if you want to do a PhD, but in industry, no one really has that time to devote. A lot of projects get stuck with subpar analyses, or simply not being done at all, because it’s not possible with what exists today.

I’d like to point out that it’s not just about training models. It’s not just about, as I think a lot of computer science machine-learning students are led to believe, what’s your training and testing error? There’s a lot more to what we want out of these models. And these demand quite a bit in terms of geometric understanding of the model itself.

Computationally, this is far more demanding. What we’re looking for are regularization paths, things that tell us how our models evolve as a function of certain regularization parameters that we might tune, and risk assessments that give us confidence bounds around the coefficients learned by the models and their predictions.

There’s also the issue of interpretability. A lot of times, in a lot of applications, it’s not necessarily the testing area that’s the most important thing, but what kind of information the model itself has picked up on. There are, really, two extremes of this. I have a number of people that I work with in finance who, if they’re going to bet a million dollars on something, they would like to understand why they should do this, and run it through their own internal knowledge.

The other side of this is in bioinformatics. One of our recent publications with Sloan Kettering resulted in a 35% improvement over the state-of-the-art for personalized medicine prediction for cancer. But what really got the biologists stoked on that paper was partly their improvement in accuracy. But what was even more was actually when we looked inside what the model had learned. We were able to interpret what was going on. And what we saw was that this model actually was linking chemotherapy drugs based on their mechanism of action.

This is a side-channel verification that really got people believing. And now, this is used in Sloan Kettering’s research groups. So at the end of my PhD, I read the elements of statistical learning to see what had changed internally inside of myself, just as I had read that book before I started my PhD. And I got it signed by my advisors, too, so that was cool. But one of the things that came out, one of the threads, was, actually, there’s a really cool history and progression there in machine learning. And I’m not going to cover all of the different paradigms that have existed.

You’ll see that I neatly left out neural networks and deep learnings, because I don’t want to get into that debate. But it all really starts with the simple models that we’ve had for many decades. Now, these are extremely simple things that we understand very well statistically and, as a result, we can scale very well too. But there is an issue. They’re very sensitive to things like featurization.

If you don’t have enough data, or if your features are poor, good luck. There are still some fields that, after decades of research, have not quite hit on that good feature set or have stalled. So I would argue that, as a result of this, we witnessed the bifurcation in the methodologies that came out of machine learning. And in retrospect, looking at them now, there’s actually a very nice complimentary division to them.

A lot of these methods, the kernel methods, were based on fundamental ideas of smoothness and continuity, versus tree methods, which are actually combinatorial objects in their own right. What I mean by this is, if you look at the conditions under which something like the kernel trick can exist, it really is quite limiting.

Now, it’s great, in that you can project implicitly to high-dimensional spaces, reason about your data on those, and actually get non-linear modeling in the original space that you live in. But this implicit nature is also its Achilles heel. You can’t, for instance, apply a Lasso regularizer with a kernel SVM. It’s intractable, and that intractability also makes it quite difficult to look at the model’s coefficients.

Counter to this is trees. Now, these are learning, essentially, conjunctions of rules that are then combined together. And they’re a very combinatorial object, in that the criteria that we’re trying to optimize is done via heuristics. And it, itself, has, generally, lots of pretty good answers.

This is what originally led to the instabilities that you see with trees. People have developed methods like bagging, boosting, ensembles, all of this stuff, to help deal with the high variance of, potentially, many solutions in this combinatorial space.

The next big shift that, I think, happened in the mid ’90s with the invention of the Lasso seems to really combine the best of both worlds in this regard. And I think this link only became apparent after the stats community managed to actually realize that, look, there is a link between what these ensemble methods are trying to do and the Lasso. They’re actually going towards the same criterion with tree features. And in doing so, this, what I would call the structure of regularization paradigm, was born.

I think this is actually one of the fastest evolving aspects of machine learning today. And it is truly empowering in what it can do. And what I find to be quite amazing about it is that the advances we’ve had have allowed us to both scale these machine-learning models to larger scales.

They come back by quite a few of the statistical guarantees that we’ve come to know from simple models. And they are actually quite easy to use, because they—provided you know which terms are relevant—because they’re asking us to model data’s qualities. And what I mean by this, structured regularization paradigm really are—is there a laser pointer on this?—is really machine learning that can be expressed, generally, as a convex optimization problem of the form of an objective that decomposes into two fundamental terms.

The first is a loss. It encodes what your goals are—simple things like regression and classification to more sophisticated things, like feature embeddings, Word2vec, that has become all the rage lately. This term generally carries with it the classical, continuous, smooth part that we’re very accustomed to. But what has changed is the realization that, actually, the regularization we employ, the prior beliefs about the data, about the qualities that our data possesses, can now be expressed in much better terms.

This started with a Lasso, something like the L1 regularizer, which basically says, I don’t think a lot of my features are necessary. And what this suddenly allowed us to do was throw obscene amounts of features at the machine learner and have it figure out which ones it actually wants to use.

We can now model in other, very powerful ways—for instance, taking into account that our data has spatial or temporal structure of some kind, or even that we should pull together different kinds of data sources, whether we know how they are related or not. And all told, there’s actually a combinatorial explosion here of what we can do. And I think we’re still, to a large part, just investigating this landscape.

Now, to convince you that there is actually a combinatorial element here—and I think, unfortunately, a lot of what we see in optimization today that really holds—these are the solvers that are performing the learning—do not address very well. Most of our techniques, things like gradient descent, and all kinds of other methods, are really based on optimization for continuous methods.

The optimization community has had a hard time dealing with sophisticated continuous functions, plus sophisticated combinatorial functions. And I would argue that this has really been something that’s been holding us back. But we need to address this combinatorial nature in this new learning paradigm.

To give you a glimpse as to what lies underneath the hood of a Lasso problem, or a generalized Lasso problem, is this graph, which actually shows a regularization plot. As we vary the regularization parameter for a generalized Lasso, we plot how its coefficients evolve, too.

What you’ll notice is that all of them actually converge to integer values. When you look at the vector of coefficients that you get, they’re all just going to be integers between 0 and 4, in this case. That’s pretty wild, because this kind of structure really is combinatorial. What this is showing us is that, sometimes, for certain feature matrices and labels, you might solve a Lasso, but in doing so, you might solve an integer program. That’s something that we need to address a little better.

My work has largely been in focusing, how do we solve these things at these massive scales? And after spending years and years trying to figure this out, I’d like to give you something that we’ve converged on, that is really the basis for my company, of what I call the database perspective.

In the right level of abstraction and modularity, mathematically, actually shows us that we should look at these as if there were simple databases or machine-learning systems. And to guide this, I’m going to use, from our running example, the subdifferential equation for the Lasso.

If you’re not familiar with this, don’t worry. I’m not going to test you on it. I am, however, going to go through the individual terms and show how they correspond to our database.

The first part of this, of course, is our data. This starts with our labels and our feature representation. And this comprises the lowest layer of our database. It’s from these feature matrices, x, and the labels, y, that you see highlighted. And in this example, for concreteness, pretend that we’re trying to do some kind of natural language-modeling task involving data from Yelp and Amazon reviews and Wikipedia, perhaps, for context.

The next layer that sits on top of this are the operations that we need to access our data. These generally include things involving multiplying our feature matrix or its transpose, and various non-linear operators.

The final layer on top is the actual query language, the machine-learning objective that we specify and try to solve. I would argue that this is really its own language, now, because of the combinatorial explosion that we’ve seen, in that earlier slide, of all the different choices we can have, for the loss that we might employ, really any kind of combination of regularizers that we might employ, and then, of course, the data sets that we want to hook together.

For instance, in this example, we may want to have—let’s say it’s a sentiment analysis problem—we may want to have some kind of ordinal regression for our Amazon and Yelp reviews, but we may also want to include Wikipedia—say, predicting its link structure—for context.

The way we’re going to fuse these models together is via, in this case, a nuclear norm regularizer, so that we can share a common language model. We don’t have to learn separate models for English for each of these databases. And you can see how, all of a sudden, even though I could model these tasks together, I can choose to model them—sorry, if I can model these tasks separately, I can model them together.

Now, there’s, really, this combinatorial explosion of possibilities. And the query language is what allows us to investigate these different properties and which ones we ought to use.

The final result of this is the query, or the machine-learning model itself. And this is actually something we need to add to our lowest layer of how we store this efficiently.

The key point about this perspective is that, time and again, in trying to figure out exactly how to solve these things at scale, every time we’ve come up with an abstraction, every time we’ve come up with some little trick that gives us mathematical modularity, it always brings us back to this database perspective.

I think this is a very powerful perspective to take, going forward, for figuring out how to solve these things. I would have loved to have given you a long mathematical lecture on some of the cool tricks that we’ve found, but my wife told me, don’t do this, everyone’s going to fall asleep, it’s 5 o’clock. Instead, I’m just going to show you some of the results that we’ve been able to obtain.

Our perspective in designing these algorithms has always been that we want to make faster algorithms. Now, classically, you think of memory and processing trade-off when you design various algorithms, but with machine learning, and, in particular, via this database perspective, actually adds another dimension. And this is super cool.

It adds mathematical structure to our axes that we can play with. I believe that, as a machine-learning community, and as the optimization side of it, we’re not really addressing the full mathematical structure that exists there. There’s more to the picture. Generally, we’re fairly happy with just taking a gradient, taking a linear approximation, and letting that be.

But when you play along this axis, you can get extremely substantial savings, both in memory and processing time, by pushing the complexity into the math realm. But this does something, actually. You no longer can use GPUs to scale up your code.

This is because the operations that you’re doing in the code, itself, now involves branching. It involves quite sophisticated operations, which pay huge dividends computationally, but now require quite a bit of intelligence. And this is where, really, in my mind, the future of machine learning is actually going to go back to CPUs, and x86 architectures, and such, to address the scales that we need.

This is where our partnership with Intel has been invaluable, too. They’ve given us a lot of really cool hardware to play with, both CPUs and Xeon Phis. This is where we’ve been able to derive some of our big results from.

I’ll just give you a couple of samples of what we’ve been able to do. This is one of the big efficiency gains that we’ve been able to do with our feature storage layer. This is for simple and ground models, where you just count the occurrence of engrams for text. And I hope you can see this all right. But this is comparing the size of the feature representation as the maximum engram length increases.

For the first graph on the left, we’re looking at the database of all Amazon movie reviews—this is several million reviews—and treating them at the character level. So here, long engrams are, actually, particularly useful, because you want to encompass multiple words. And what we found with our representations is that, actually, we can get savings up to 50 times by being careful about the mathematical structure of these feature representations and leveraging it very carefully.

The savings get even more extreme when you move into the bioinformatics realm. There, engram models are pretty much never used, because they’re considered computationally intractable, up until now.

We, in this realm, can get, essentially, quadratic savings over memory. And in this particular data set with the 1,000 genomes project, we are able to get savings of over 10,000 times, in terms of our memory requirements.

Computationally, this pays huge dividends, too, because our operations are actually quite simple, in that they’re highly parallelizable, but they have branching instructions inside of them—but parallelizable nonetheless. They’re also fairly streaming through the data, which means that you could, if you wanted to, if you didn’t have a lot of money, or maybe you just had a ridiculously-sized data set and couldn’t pull it all into RAM, just run this legitimately off of the hard drive, and just have our database accessing your hard drive, using it in place of RAM. Maybe you have really fast SSDs, in which case, you’re actually not going to see too much of a difference. The other extreme, though, is, because of the savings in memory of this and the parallelism, you could also pop this into a Xeon Phi with a large memory system in it, and really fly through your data.

We’ve actually performance-tuned some of the stuff to allow us to fly through billions of features, over millions of training examples, on the order of seconds with a Xeon Phi, which is pretty spectacular and unintuitive that you could do this.

The other kinds of savings that we can get come from looking at optimizing the machine-learning query language that sits on top. Here, I’d like to—so this is a little bit abstract, perhaps, but this is leveraging the fact that, generally, when you’re experimenting with data, you’re actually solving the model multiple times. So we really should store that.

We’ve actually figured out ways to store models in trivial amounts of space, relative to the amount of data you have. So it’s not a problem. We can basically store every single model that you come up with efficiently. And you should really use these when you try to solve for a new model.

For instance, in this particular example, I’m looking at the regularization path between a ridge regression and a Lasso problem. In between there is where I’m solving an elastic net, where I’m trading off between my L1 and L2 regularizers.

Each of the red dots represents a model that I’ve actually solved for. Now, if I query, if I want to know what’s my testing are likely to be like for the model represented by that red arrow, you actually could probably give me a reasonable guess using the models that you’ve already trained on. And you can give me some kind of bound, at least. So then, I can determine, should I really try to solve this, or no?

What’s kind of wild, though, is, actually, if we look into the mathematical structure of these problems, we can reason about problems that are, on the surface, not similar at all to what we’ve solved. For instance, maybe in this case, you are curious about, should I use some kind of epsilon-insensitive regression loss to help me deal with noise in my data?

The solution to this problem actually came about from discussing a problem with my wife, who is a bioinformatician. And she had this large data imputation problem on her hands. And so there was an existing package for regression out there with a regression loss that she could use, and life was good. But her data was ordinal. So she asked me, well, do you think it’s worthwhile if I switch to ordinal data? I mean, now—she also has a degree in computational math, so she had the chops for it, but this would have taken her several months of work just to optimize this.

I looked into this problem, and it turned out that we could reason about whether it was useful or not for her to go down this endeavor without ever solving it, based on the experiments she had already conducted with the regression loss. And the answer turned out to be, no.

Now, we never have false negatives, it turns out, with this scheme, so I didn’t divert her from what would have been a useful path. That’s provable.

The other side of query optimization is the solving itself. And this is where, I think, a large aspect of the combinatorial structure that exists in these problems needs to be leveraged. What I’m showing you here are the running times of our internal solver, relative to a conjugate gradient method, which essentially varies between blue and green.

You can think of gradient descent to somewhat of a quasi-Newton method on the other extreme. And this is the number of times we need to touch the data. Now, the fewer times we can access the data, the substantially faster it will be.

What you see here is that, by reasoning about the geometry, essentially a runtime analysis of our query language, very carefully reasoning about the geometry of where we are, where we want to go, and what the landscape looks like around us, we can actually decrease the number of iterations substantially.

In some respects, gradient descent is like walking around with your eyes closed trying to get somewhere. And this is, at least, opening them a little bit to see what’s around you in the immediate realm, and using that information.

With this, I’d like to discuss some of the exciting bioinformatics results that we’ve actually achieved with this database perspective on the technology that we’ve built. We’re, like I said, just getting started, so there’s quite a bit coming.

But I find it very encouraging that, in the last few months, we’ve been able to pump out so many results. The first is our publication with Memorial Sloan Kettering. They had been struggling with this problem of figuring out, this personalized medicine, how to improve it for their chemotherapy drugs, given various cancer markers, for about a year.

They had a lot of really good data, but they had just been struggling to get something that worked. Now, at the time that I spoke with them, I was in Germany. My wife was at a conference, I was with her.

My mind was much more on drinking beer, and eating pretzels, and taking in the landscape. But after a week of working with them, we got results that beat the state-of-the-art by 35% and, like I said, is the system that they use now.

Another problem that we were inspired by was this great paper that came out for what is known as metagenomic binning. It’s essentially, if someone were to take a sample of gut bacteria, and got their DNA all scrambled up, they’d like to infer the proportion of different kinds of bacteria in your stomach.

There was a really cool paper that actually investigated that use of engram models for this and broke barriers for performance. But in order to do that, they used an approximate solver, the Vowpal Wabbit, which threw away any kind of model interpretability.

And they used the massive cluster that made this absolutely ridiculous to try to do. We looked at this problem, realized it actually fell within what we were doing, and managed to get this to work on the order—we can solve a better model, an exact model. It takes about an hour running on my wife’s laptop, which is pretty fast.

This actually resulted in a completely new method for assessing the quality of DNA. You sequence some DNA—is this really from the organism you think it is? And this was a previously unsolved problem that is now being used at Stanford. And we have several other examples, but what I’d like to jump to is something that I find very exciting.

Hopefully, we’ll submit this paper to NIPS. So hopefully, I’ll see you there and be able to discuss the exact results from it. But this was a response to a recent publication on using deep learning to understand, essentially, words, using deep learning at the character level on text to try to understand text reviews.

The question was, can we get comparable performance to the prior of knowing something about the English language that there is this word structure? So their model required—it was something like, I believe, five days per epoch to train on several GPUs, and it took 30 epochs for it to converge. You can do the math about how long that took.

I really hope they didn’t have any bugs in their code, because restarting that would be awful. And what they found was, of course they got great performance. But we tackled this problem with our feature representation schemes that I had shown you before, again, on the character level of all Amazon reviews.

With the help of Intel, we were actually able to get this to scanning through the data on the order of seconds to minutes through all of it. And this is billions to tens of billions of features that we’re scanning over, tens of millions of examples. And we were able to solve entire regularization paths of models—not just one, but about 50 of them—in order to do our parameter tuning on the order of days. This was pretty spectacular.

We, of course, got top-notch performance as well. But then, we were able to do some really cool things because of how fast we could play around with this. One of them was moving into a multitask scenario, where we trained common language models that took into account the different product categories.

This was now hundreds of models, so we multiplied the total number of coefficients that we need to learn by several hundred. And it turned out, there was just enough mathematical structure there to not incur any kind of performance penalty.

That was pretty crazy. So we were able to get results with that, too. And going further, actually, what we realize is, you can characterize, really, the entire mathematical structure of things like the Lasso and a lot of these estimators from this structured regularization paradigm to tell us, what are the fundamental limitations of machine learning for learning word structure?

Suppose I didn’t know it, what is the best I could hope to do? It turns out that, with existing methodologies, there are, probably, pretty severe restrictions on what you could hope to find out. But in understanding those restrictions, we were actually able to come out with a new class of regularizers that, I hope, will eventually become mainstream in their usage for when people work with characters that, probably, do a lot better in understanding the structure. And DNA is one of the things we’re going to try to understand with this.