Classify all the Things (with Multiple Labels)

by on July 23, 2018

Derrick Higgins of American Family Insurance presented a talk, “Classify all the Things (with multiple labels): The most common type of modeling task no one talks about” at Rev. Higgins covers multilabel classification, a few methods used for multiclass prediction, and existing toolkits. This blog post provides highlights, the video, and a transcript of the talk.

Session Summary

At Derrick Higgins’ talk, “Classify all the Things (with multiple labels): The most common type of modeling task no one talks about”,  he advocates that data scientists consider “adding the appropriate tools to their toolkit for multilabel tasks”. He introduces the distinction between multilabel and multiclass classification, presents a few methods used for multilabel prediction, and provides pointers to helpful toolkits.  Key highlights from the session include

  • exploring multilabel datasets in the public domain including BibTeX, delicious, birds, medical, and scene
  • delving into multilabel versus extreme multilabel versus collaborative filtering
  • strategies data scientists can pursue to model multilabel tasks, including label powerset, RAKEL, Conditional Bernoulli Mixtures as well as structured prediction
  • coverage of toolkits such as sk-multilearn, pyramid, svm-struct, pyStruct, and more.

For more insights from this talk, watch the video or read through the transcript.


Video Transcript of the Presentation

Derrick Higgins:

All right, thanks, everybody, for coming out today. I am Derrick Higgins, the head of the data science and analytics lab at American Family. We are a special projects team there, we’ve been working on a lot of different things. This talk is inspired by a couple of projects that came our way recently, where all of the tasks seem to fall into a particular framing.

If you look at the kind of texts that aspiring data scientists are exposed to when they start in the field: look at blogs, you look at introductory machine learning textbooks and so on, you kind of get the perspective that there are basically two things that data scientists are being asked to do today. There are classification tasks and there are regression tasks.

This is from one of my favorite introductory machine learnings textbooks by Christopher Bishop: “In classification problems, the task is to assign new inputs to one of a number of discrete classes or categories. However, there are many other pattern recognition tasks, which we shall refer to as regression problems, in which the outputs represent the values of continuous variables.” So he’s not actually saying there’s nothing else in the world besides these two types of problems, but it’s kind of suggestive.

This is not actually that crazy, because I’m sure everybody here in the room knows, there are just a lot of tasks that fit into one frame or the other. If a data scientist has a good set of tools that she can use for classification tasks, multiclass classification tasks, for regression tasks, maybe some other stuff that’s unsupervised, clustering and dimensionality reduction, there’s just a lot that she can do. There’s a lot of tasks out there that she’ll be well equipped to take on. As her career progresses or as she takes on more and more responsibility, she’s going to need to broaden her tool set a little bit and add some more heavy-duty things for tasks like segmentation, machine translation. But really, just with that core set of tools for multiclass classification and for regression, she’s going to be pretty well equipped for those tasks.

Today, I want to talk about a slightly different but related set of tasks, in particular multilabel classification, and suggest that maybe a lot of data scientists out there might consider adding the appropriate tools to their toolkit for multilabel tasks, given how frequently they come up. Okay, so, multiclass versus multilabel classification, probably a lot of folks are already familiar with the distinction or what’s meant by the two. Multiclass classification is by far the more common task framing; we’re trying to take an input vector of Xs, so M input features, and map each of them to the most likely class of a set of N possible labels. The output space is guaranteed to be, the output labels are guaranteed to be mutually exclusive, and at least one, exactly one, has to be assigned to every input vector.

The multilabel framing is slightly different in that we have, again, a set of N output labels, but they’re not mutually exclusive and we’re not guaranteed that any of them are actually going to be appropriate for a given input instance. An example task of this type would be one that involves looking at information, say, images related to issues that occur in somebody’s home and trying to identify attributes of those instances. One instance may be an issue that is in the interior of the home represents a hazard and is electrical. Another may be a plumbing issue in the interior, so we have multiple attributes for these different types of instances.

A couple of interesting things to note or keep in mind about these types of tasks: one, these labels that we’re trying to predict can sometimes be very related to one another, either positively or negatively. So some of the attributes that we look at in an image to determine whether it’s electrical may be very similar to the attributes that we need to look at to determine if it’s a hazard or not.

The other thing about this sort of task is that there may be constraints that are hard or soft that, in the output space, they determine how coherent the labeling is. It could be that certain types of issues are both interior and exterior issues, so maybe there’s something to do with the windows, which are sort of at the border between interior of the home and exterior of the home. But it’s not going to be common. That’s going to be, by far, dispreferred to pass through a labeling that assigns both interior and exterior labels. Even more of a hard constraint, I guess, would be if you have some sort of a taxonomic relationship between your labels. We may have even more specific tags that say there’s an issue having to do with wiring, or there’s an issue having to do with junction boxes, and it would be confusing for downstream users if those specific tags were applied and not the more general electric tags, the electrical tag as well.

Once you start looking for these kind of multilabel tasks, it seems like they kind of crop up all over. Here’s some examples of multilabel datasets that are out there in the public domain. You can go to various repositories and download them. A couple of these, BibTeX and Delicious, have to do with semantic tags that users have assigned, in the case of BibTeX, to bibliographic entries; in the case of Delicious, to web pages. The Birds dataset is actually a set of audio snippets where the task is to identify the species of birds that are present in the environment, based on the vocalizations that you can hear in the audio clips. Of course, there might be multiple birds that are present. Medical is a dataset of free text clinicians’ notes, and the tasks there is to identify diagnosis codes that should be assigned to a patient based on the notes that the doctor or clinician took.

And then finally, the Scene dataset is a dataset of images that are classified by scene types, so they could be a beach scene, there could be a city scene, and of course there could be, well, maybe not city and beach so much. I’m from Chicago, so that’s okay. Maybe city and sunset are a reasonable combination. These are pretty diverse in terms of the number of labels, ranging from six labels to up to a thousand labels, almost, for Delicious, and then the number of rows as well. Under a thousand up to, like, 16,000 for Delicious. That kind of spans the space of what I really want to talk about in terms of multilabel modeling.

In the multilabel classification task type I’m talking about, we have these predictive features, this matrix of features X where we’ve got N instances and L features to use for prediction associated with each of those, and then we have these M labels where M is sort of a manageable number. It’s, like, less than a thousand tags or labels that we want to assign to each of our input instances.

There is another type of task that people talk about sometimes called extreme multilabel classification, and that is what you might expect; we have kind of a less manageable number of labels that you want to predict for each of your input instances. It might be 100,000 labels, it might be a million tags, and these things can be, for instance, songs from a playlist a user has chosen to play, or items from an inventory that a user has bought. So semantically, in terms of what you’re trying to achieve, often it’s very similar to collaborative filtering tasks, with the main difference being that, in this extreme multilabel classification task, you have these additional predictive features where, traditionally, in collaborative filtering, you don’t have the features that you’re using to predict as well, you have just this set of products or songs or labels, whatever, and then a set of users or instances that are associated with them.

Okay, so, at a high level, multilabel tasks, there are three general strategies I’m going to say you can pursue to try and model these tasks. The first one is really tempting, it’s just sort of put your head in the sand, pretend there’s no problem, deal with it using the tools you have and forget you ever heard the word “multilabel.” This is maybe not as crazy as it might sound, because there are a lot of tasks out there, including some of the ones that my team has been working on, where it is not strictly a multiclass classification task, but it’s pretty close. It’s almost always the case, in fact, that your train vectors and your test vectors that you’re trying to do predictions for are one hot vectors of labels. It’s almost always the case that at least one label is applied and it’s almost always the case that it’s a unique label.

So you can imagine if we kind of focus our attention just on the sub-part of this home issue classification task that has to do with the sub-system of the home that is affected, then electrical issues are going to be pretty common, plumbing issues are going to be pretty common, issues to do with a specific appliance might be common. And occasionally you’ll get something where there’s an electrical issue to do with the water heater and some associated plumbing stuff related to that, but it’s going to be far less frequent than any of these tags occurring independently. It might be possible to sort of massage your data in some way and treat the task as if it was just multiclass classification, and I think that is important, at least, to include as a reasonable baseline when you’re doing this kind of modeling.

If we were to do this graphically, what we’re doing here is some sort of sub-sampling, throwing away training data that has multiple classes associated with it or somehow transforming the data to discard labels that we think are less important. And then once we’ve done that, we can just train your favorite multiclass classifier and then do one hot encoding of the outputs, and you get a matrix that you can feed into whatever your downstream multilabel evaluations are and compare with whatever fancier methods you might consider.

That was strategy one. Strategy two, then, is, okay, bite the bullet, realize we actually have a multilabel problem that we need to deal with in some way, but, you know, I’ve got this set of tools; is there some way that I can transform the task or find a way that I can apply the tools that I have, apply what I’ve got to just solve the problem that I have? There are a variety of problem transformation approaches to this general set of tasks, and the first of these, which is called binary relevance, is probably the first thing that you all thought of when I put up the multilabel task to begin with.

Namely, that you’re just going to build a bunch of binary classifiers for all of the labels in your output space, which basically amounts to, just assuming the conditional independence of all of the labels that you’re dealing with, and just, I guess, in passing I’ll note that I’m going to be using, bold X, as the vector of input features, and bold Y as the feature, the vector of output labels. It’s pretty conventional. So again, graphically, this is what it looks like. We’ve got a bunch of input instances, we’re going to pass them to a bunch of independent classifiers that are associated with each of our output classifiers. Sorry, output labels: one separate classifier for interior, one classifier for exterior, one classifier for hazard and so on.

Great, we have a way of doing the task that allows for multiple labels to be assigned to any given input instance, but we’re not really handling the dependency between the labels at all. There’s no way of kind of penalizing combinations that we think are disfavorable or should be disfavored or are unlikely, such as interior and exterior at the same time.

So that’s that one extreme of our set of options for transforming the task, where we just assume that all the labels are completely independent. This label powerset method, then, is at the extreme opposite end of the scale, where we’re going to say, actually, any combination of labels is entirely unique, distinct and shares no characteristics with any other combination of labels that we could assign to a given input instance. It’s called label powerset because we’re basically taking the powerset of these two to the N possible combinations of labels as the set of classes that we can consider assigning in a multiclass classifier. We have some classifier that’s going to do scoring in the output space, and then we’re going to normalize over the entire powerset of label combinations to get a probability for that.

Now, I should say, in reality, just a small subset of the powerset of label combinations are actually going to be instantiated in our training data, so we don’t typically have to deal with two to the N, but still, we have to deal with a pretty large number of label combinations. So this is what it looks like graphically: approximately, we’re going to just build one big multiclass classifier, and it’s going to have a ton of output classes. It’s going to have hazard assigned on its own as one possible output class, it’s going to have electrical as another output class. And then unrelated to either of these, it’s going to have a class like exterior-slash-hazard-slash-electrical, and then it’s going to have yet another appliance-slash-hazard output class, and just more than I can depict on this slide. All of these different combinations of labels that could be applied.

This is kind of a brute-force extreme approach; obviously there are some problems with it. For example, the lack of information-sharing between labels that would seem to be, or between classes that would seem to be related, like interior-slash-hazard and interior-hazard-electrical. And also sparse data issues, where the support for each of these classes is likely to be very small, or least some of these classes are likely to have very low support in their training data. But if there are really kind of unique, distinct characteristics that some of these combinations of labels have, like interior hazards legitimately having different evidence behind them, than exterior hazards or hazards that don’t have either of those tags, it could be affected.

We mentioned some of the drawbacks of label powerset, and there’s this random K-labelsets method, also called RAKEL, which is an attempt to address some of the drawbacks of label powerset while still leveraging the fundamental method. I’m going to go a little bit out of order here, but the idea is that you’re going to build label powerset classifiers, but only for sets of K labels at a time, so K may be three, K may be four. You’re going to be a label powerset that’s only responsible for, say, four labels, and then you’re going to build a whole ensemble of those. This is saying in our set of label powerset classifiers, each is responsible for K labels, and they are J of them in total.

Each label powerset classifier in the ensemble is going to assign a score to a given label, so if we’re trying to figure out the score for label Y-sub-i, which could be electrical, then we’re going to count that classifier’s score for all the label subsets that include that label, Y-sub-i, so we’re going to count the score for electrical, count the score for hazard and electrical, count the score for interior electrical, we’re going to add all of those up, and then we’re going to normalize by the score of all the label subsets that that label powerset classifier is responsible for. That’s each classifier in the ensemble of label powerset classifiers, and then we’re going to aggregate across all those label powerset classifiers to get an overall score for each label at the RAKEL level. We do that by basically averaging over all of the label powerset classifiers which have that label in its output space, because not every classifier in the ensemble is going to have the opportunity to even assign a label like electrical, because it may not be among the K that it’s responsible for.

That’s probably much less clear than a picture; we’re just going to have, instead of one big multiclass classifier, we’re going to have a number of big multiclass classifiers where the output space of each is the powerset of a subset of the total set of labels. This is a case where K equals three, this multiclass classifier is going to be responsible for hazard, electrical and interior. This one’s going to be responsible for plumbing, appliance and interior, and this one’s going to be responsible for hazard, plumbing and exterior. They’re each going to assign a score to the labels in their set, and then those are going to be aggregated using the method that I mentioned on the prior slide.

So yeah, RAKEL attempts to address some of the problems with label powerset, but it comes at the cost of introducing a couple of hyper-parameters. Now we have this hyper-parameter J, which is the number of ensembles we’re going to use. We also have hyper-parameter K, which is the number of labels that’s in each label powerset classifier. We might have to search over those.

Okay, so somewhat simpler to describe is classifier chains, and this is what it sounds like: you’re going to build a bunch of classifiers, but you’re just going to apply them in serial. The output of the first classifier will be a single label, and then that predicted label will be an input to the next classifier in the chain, and you’ll go like that through the entire sequence. The architecture here is pretty simple: one classifier is first going to predict this interior label, the predicted value of the interior label will be input to a classifier of the next exterior, and so on down the line so we predict all our labels. So that’s a relatively simple way of introducing dependencies among the different labels and allowing for some conditioning of each on the next.

But again, there are drawbacks as well. One of those is that, of course, errors in each of these classifiers are going to propagate, and so the longer this chain is, the bigger problem it’s going to be. The other problem is that there’s really no rationale for why we decided to predict interior first; we could’ve predicted interior last and conditioned that on electrical and so, and the random order in which you choose to chain these classifiers may not be optimal.

All right, so the last method I’m going to talk about that involves problem transformation is something called Conditional Bernoulli Mixtures. These are mixture models; the basic idea is that you’re going to do a soft assignment of your input instances to a cluster or to mixture components, and then based on that cluster, you’re going to use a specialized classifier to assign the labels. In particular, it’s going to be a binary relevance classifier, which means that the label assignment within each mixture component is going to be independent. So the individual classifiers in your ensemble at the second level are going to treat the labels as independent, but because of the mixture model structure, because this is conditioned on the mixture component membership for each instance, the labels are going to be dependent in the model as a whole.

There are two models, or two types of models that need to be trained in a Conditional Bernoulli Mixture model: there’s this top-level model that says, okay, I’ve got these capital-K clusters or capital-K components in my mixture model, which one does this instance belong to? And then there is this lower-level model, which is just our standard binary relevance model. So if we’re in cluster three or if we’re in mixture component three, mostly, what is the probability of each label, based on the characteristics of the instance and based on what we know about this cluster?

So here’s how that looks. We’ve got this sort of hierarchical structure to a model, where we’ve got this multiclass classifier that we’re training. Again, it could kind of be whatever your favorite classifier is, subject to some constraints I’m going to talk about related to training. And then at the lowest level, you’ve got these binary relevance models, which are relatively simple, just independent binary models for each of the labels. So two issues with Conditional Bernoulli Mixtures, one has to do with training: it’s not simply a model that we can train end-to-end using gradient descent or something like that because, you’ll recall a couple slides ago, we have this Z. This Z here is the latent class distribution or latent mixture component distribution for a given instance, and that we can’t observe. So we have to estimate it iteratively.

Instead, we have to do Expectation Maximization, where we come up with some estimate of Z and then, based on value, the current estimate of Z we’ll train each of our classifiers, and then we’ll update our estimate of Z so that the training procedure’s a little bit more complicated and also more time-consuming. And then inference can also be complicated as well. Doing inference in this model is not, what we want do is not actually just find the expectation from the model for each of the labels independently. Instead, we want to find the optimal label assignment across the entire label space, which may be different from considering each of the labels in independence.

For the details on that, I’m just going to refer you to the paper. I think the citation may have fallen off the bottom of the slide here. Happy to provide the slides and citations afterwards. But there’s some complexity there in doing dynamic programming or sampling to get the inference.

That’s a very quick survey of different methods you can use related to transforming the problem into something we can approach with standard tools for multiclass classification or binary classification. The third and final general strategy that we have for dealing with multilabel classification tasks is to really pull out the heavy machinery and decide we’re going to treat this as a structured prediction task. You may be familiar with structured prediction in other contexts; it’s kind of a standard set of tools and methodology for predicting structured objects, rather than predicting something that’s like a number or a class.

This comes up a lot in computer vision or natural language processing, for instance. Computer vision’s typical structured prediction task might be image segmentation, where you’ve got a bunch of pixel-label pairings that you want to predict and they’re all strongly related to one another. Or in natural language processing, where you’ve got this dependency structure that you want to assign to a sentence, or a tag structure that may associate a given tag with each word. There’s strong interdependencies between ways that part of speech tags are associated with each word.

In these types of contexts, you don’t really have much of a choice. These are just structured objects, and you need to find some way of fitting these into a standard classification framework, and structured prediction’s a way of doing that. But the same methodologies can be applied even when you don’t have quite so much structure in your output space as in the multilabel case where we have just a bunch of labels that may be inter-correlated or anti-correlated.

In a multilabel classification task, we’re trying to take a set of inputs, Xs, and map them into this space. Map them into a set of Ys, a set of labels, and the fundamental idea of structured prediction is we’re going to transform that task from mapping inputs into outputs into one in which we map pairings of inputs and candidate outputs into some score. So instead of generating predictions, we’re going to be critiquing predictions that somebody has already given us, somehow.

The challenge there, of course, is that we don’t anymore have a very straightforward way doing inference. In sort of a more standard classification paradigm, we can just feed information forward to our network or apply our classifier and get the optimal Y that is predicted. In a structured prediction framework, we actually have to somehow compute this argmax, find the Y that maximizes this scoring function.

This is, I promise, the last diagram of this type that I’m going to show. Basically, the idea is you’re going to now pair up your candidate labelings with your input instances and then pass this to a scoring function. That’s the final of structured prediction. Maybe the most popular way of doing structured prediction is a framework called conditional random fields. This is really popular in natural language processing, especially, where you’ll see linear chain conditional random fields, where the features have to do with the label assignments from adjacent words or something like that. But basically, it’s just a log-linear model. It’s a globally normalized log-linear model that’s going to apply to an instance and the entire candidate labeling that we’ve assigned to it. Log-linear model–this is just the normalizing constant that we typically don’t have to compute.

The thing that is different about a conditional random field is that we have two different types of features that are associated with parameters, these lambdas in the model. There are these feature functions F, that have to do with how strongly our predictive features are associated with our output labels, and then there are these feature functions G that have to do with how strongly pairs of output labels are related to one another. Basically, whether the labels are correlated with one another.

There is some complexity to making the training efficient, especially because you have to do inference in the process of training, but once you solve these mathematical problems, they can be trained using convex optimization. Again, because we’re doing approximate inference to compute this argmax, there are a couple of approaches there that could be used. You can consider only the supported combinations, so when you’re considering all the combinations that could be candidates for putting into this scoring function, you could only consider the ones that actually showed up in your training set, which is going to be much smaller than the set of possibilities — the powerset. Or you can do binary pruning, where you basically just build, in parallel, a binary relevance classifier, and then you only consider labels that get a score above some threshold when computing the different combinations that could be scored.

Very similar to conditional random fields are structural SVMs, very similar in that they use these feature functions that operate between feature label pairs and pairs of labels that come from a factor graph. Again, there’s some complexity in how to actually optimize these things. It’s an approximation to the standard support vector machine training problem where you iteratively select a subset of the constraints that you would be trying to satisfy, and the trade-off there is some epsilon difference in the accuracy of the convergence. So I’m not going to go into that in great detail, but it’s very, very similar to the conditional random fields setting.

And then last structured prediction method that I’m going to talk about is deep value networks. As with anything in data science, yes, there is a bleeding edge: deep neural networks’ approach to doing structured prediction, actually, a couple of them. Deep value networks is the one that came out in a, I think, ICML paper last year, so I’ll talk about that one.

All these structured prediction methods basically are critiquing functions: you take your input X, your candidate output Y, and compute some score. For deep value networks, the score specifically is a task-specific loss function, so if you’re doing image segmentation, a standard evaluation metric there is IOU, intersection over union, comparing sets of pixels. If you’re doing natural language processing, it might be some measure of constituency overlap for dependency structures. But for multilabel classification, one of the metrics we care about is instance F1. The deep value network is going to take these pairs of feature vectors and candidate labelings and see if it can predict what the instance F1 value will be for those labelings.

So if we’re going to do this, you need a couple things. First thing is you need training data; you need to have paired feature-prediction-loss function triples. Not pairs, triples. There are a few ways to generate these: you can randomly generate these candidate labelings from your dataset, or you can select according to some model internal criteria, based on how likely those labelings are or based on how informative that labeling will be for the model. Actually, in the paper they use a sort of hybrid between these, sampling the candidate labelings according to all three of these criteria together.

And then once you’ve done that, you just train the network to try and predict this loss function based on the sample of synthetic training data that you’ve put together. It’s kind of interesting, at the inference stage as well where we, again, we have to compute the argmax for this scoring function for deep value networks, because it’s a neural network, we can actually use backpropagation to do this, because your network is going to take a vector of input features that’s paired with a candidate labeling, and then the output’s going to be the loss function.

What you’ll do is, at inference time, we’ll input the feature vector together with a dummy labeling, which is just all zeroes. You’ll feed it forward through the network to get the estimated loss value, and then you’ll backpropagate it in order to minimize that loss. But you won’t change any of the weights in the networks, those will be frozen. You’ll only change the Y values in the input space. Just do that until convergence to get the best Y, or the best Y vector, which is the label that you want to assign.

I can see you’re all very excited now to learn about the new methodologies you’re going to use for multilabel tasks. You want to go home and do this, you realize you have three multilabel datasets sitting on the shelf and, where’s the toolkit I can use for all this stuff? The bad news is there’s not really a toolkit that you can use for all of these methods. There is a thing called sk-multilearn, which is great; it sounds like it should be sklearn for multilabel modeling. It does have some good functionality in there, but really, it’s much around the methods for problem transformation, as opposed to the structured prediction stuff. The structured prediction stuff is a little bit more distributed across different toolkits.

The pyramid toolkit has some good stuff for structured prediction, although not structured SVMs, you have to get those from another aisle in the supermarket. And the deep value networks are just in a git project that you can download. You kind of have to mix and match a little bit to do all this different stuff.
I want to close with just a brief evaluation, but before I do that, a couple quick notes about evaluation, how to do it in the multilabel context. This is typically what we have, is input to our evaluations, we’ve got a set of known gold-standard labels for each of our N instances, and then whatever the classifier structure we use, we get predicted labels for each of those instances, and then we want to compute our evaluation metrics based on these two matrices. But there are two ways to do it, for at least many of the metrics.

One is instance-based evaluation, whether you’re computing precision or recall or f-measure or whatever, you’re going instance by instance and computing that score for each instance and then averaging or aggregating across all the rows. The other, unsurprisingly, is label-based aggregation, where you’re computing precision, recall, f-measure for each label, each column independently, and then aggregating across those. Just be aware those both exist, they both get reported in published papers, they’re both reported by different toolkits, and they’re not at all comparable. So make sure you’re computing the right thing and comparing the right way across different toolkits, which you, unfortunately, will have to use.

Also, multiple evaluation metrics that you might care about; I’ve been talking about F1, in particular, instance F1, which is the harmonic mean of precision and recall. Another important and more stringent evaluation criteria is subset accuracy, which is how often you actually assign the exactly correct subset of the total number of the total set of labels to each instance. That’s actually the measure that many of these methods are optimizing for. Overlap score is very similar to this IOU in image segmentation in the intersection between the predicted and gold-standard labels divided by the union of the two, allowing for partial credit, much like F1. And then hamming loss is just the total number of zero-one misses across all of the individual instance-label combinations in the test set.

Okay, so, here’s the evaluations, and this is far from comprehensive, this is just kind of what I could do in the course of a couple weeks, using methods from different toolkits. One evaluation measure, which is instance F1, and two different datasets, BibTeX being a little bit bigger, Medical being a little bit smaller along both dimensions. First place to start, maybe, is binary relevance, because it’s kind of the first thing that most of us would try, and it does okay. But then if you compare just putting our heads in the sand and treating it as a multiclass problem, maybe it doesn’t do quite as well for BibTeX, but it actually does better for Medical, observing the constraint that no more than one label, in fact, exactly one label has to be assigned to every instance.

It may be useful to think about why that would be, that we’re bringing in this external knowledge, this external constraint, that you can’t have no label assigned to any given instance. That can help to kind of counteract shrinkage issues that can happen with these binary relevance models, where they’re reluctant to go out on a limb and predict a rare class.

The problem transformation approaches: label powerset does really badly on BibTeX, but does a little bit better on the Medical dataset, maybe related to the fact that there are many more labels, there are about three times at many labels in the BibTeX dataset as in Medical. RAKEL still doesn’t do as well as binary relevance here, but for Medical, does a little bit better. Classifier chain gives us a little bit of a bump for Medical, but it is basically the same as binary relevance. And then the Conditional Bernoulli Mixtures, among all the problem transformation approaches, performed the best. Possibly because it’s the most complex model.

Then looking at the structured prediction methods, conditional random fields, I mean, I guess there’s a reason they’re so popular. They do very well across both datasets. Structured SVMs I think I probably didn’t give a fair shake here because I didn’t let the algorithm run to convergence; I ran out of patience after a couple days. And then deep value networks; this is a really interesting case, because relative to the baseline here of binary relevance, deep value networks do great on BibTeX, kind of competitive with CRF, but really, really awful on Medical.

I’m not sure exactly how to allocate the blame here, but I will say BibTeX is a dataset that was reported on in the associated paper for deep value networks, so I was able to replicate the results that were reported there. But I was unable to get anything working very well at all for Medical. Could be that I didn’t try the right hyper-parameters or something, or it could be something more fundamental to the internals of the model here where, again, these different ways of synthesizing training data for the neural network just don’t work well for this Medical task.

Maybe that resonates with some of you who have tried bleeding edge neural network techniques that were reported a year ago at ICML and had difficulty putting them to practical use. Anyway, that’s all I have, so thanks for your attention.

This transcript has been edited for readability.