Model Evaluation

by on June 14, 2018

This Domino Data Science Field Note provides some highlights of Alice Zheng’s report, “Evaluating Machine Learning Models“, including evaluation metrics for supervised learning models and offline evaluation mechanisms. The full in-depth report also includes coverage on offline vs online evaluation mechanisms, hyperparameter tuning and potential A/B testing pitfalls is available for download. A distilled slide deck that serves as a complement to the report is also available.

Why Model Evaluation Matters

Data scientists make models. Oftentimes, we’ll hear data scientists discuss how they are responsible for building a model as a product or making a slew of models that build on each other that impact business strategy. An aspect of machine learning model development that is both fundamental and challenging is evaluating its performance. Unlike statistical models which assume that the distribution of data will remain the same, the distribution of data in machine learning models may drift over time. Evaluating the model and detecting distribution drift enables people identify when retraining the machine learning model is needed. In Alice Zheng’s “Evaluating Machine Learning Models” report, Zheng advocates for considering model evaluation at the start of any project as it will help answer questions like “how can I measure success for this project?” and avoid “working on ill-formulated projects where good measurement is vague or infeasible”.

Evaluation Metrics for Supervised Learning Models

Zheng indicates that “there are multiple stages in developing a machine learning model…..and it follows that there are multiple places where one needs to evaluate the model”. Zheng advocates for considering model evaluation during the prototyping stage, or when “we try out different models to find the best one (model selection)”. Zheng also points out that “evaluation metrics are tied to machine learning tasks” and that “there are different metrics for the tasks”. A few of the evaluation metrics Zheng covers in the report include classification, regression, and ranking for supervised learning. Zheng also mentions that two packages to consider include R’s metrics package and scikit-learn’s model evaluation.


Regarding classification, Zheng references that among the most popular metrics for measuring classification performance include accuracy, confusion matrix, log-loss, and AUC (area under the curve). While accuracy “measures how often the classifier makes the correct predictions” as it is “the ratio between the number of correct predictions and the total number of predictions (the number of data points in the test set), confusion matrix“ (or confusion table) shows a more detailed breakdown of correct and incorrect classifications for each class.“ Zheng notes that using confusion matrix is useful wanting understand distinction between classes when “the cost of misclassification might differ for the two classes, or one might have a lot more test data of one class than the other.” For example, the consequences of making a false positive or false negative in a cancer diagnosis are different.

As for log-loss (logarithmic loss), Zheng notes that it “if the raw output of the classifier is a numeric probability instead of a class label of 0 or 1, then log-loss can be used. The probability can be understood as a gauge of confidence… it “is a “soft” measurement of accuracy that incorporates this idea of probabilistic confidence.” As for AUC, Zheng describes it as “one way one way to summarize the ROC curve into a single number, so that it can be compared easily and automatically.” The ROC curve is a whole curve and “provides nuanced details of the classifier”. For even more explanations on AUC and ROC, Zheng recommends this tutorial.


Zheng notes that “one of the primary ranking metrics, precision-recall, is also popular for classification tasks”. While these are two metrics, they are commonly used together. Zheng indicates that “mathematically, precision and recall can be defined as the following:

  • precision = # happy correct answers/# total items returned by ranker
  • recall = # happy correct answers/ # total relevant items.”

Also, that “in an underlying implementation, the classifier may assign a numeric score to each item instead of a categorical class label, and the ranker may simply order the items by the raw score. Zheng also notes that personal recommendation is potentially another example of a ranking problem or regression model. Zheng notes that “the recommender might act either as a ranker or a score predictor. In the first case, the output is a ranked list of items for each user. In the case of score prediction, the recommender needs to return a predicted score for each user-item pair—this is an example of a regression model”.


With regression, Zheng indicates in the report that “n a regression task, the model learns to predict numeric scores.“ As noted earlier, personalized recommendation is when we “try to predict a user’s rating for an item”. Zheng also notes that one of “the most commonly used metrics for regression tasks is RMSE (root-mean-square-error” which is also known as RMSD (root-mean-square-deviation). Yet, Zheng cautions that while RSME are commonly used, there are some challenges. RSMEs are particularly “sensitive to large outliers. If the regressor performs really badly on a single data point, the average error could be very big” or that “the mean is not robust (to large outliers)”. Zheng notes that there will always be “outliers” with real data and “the model will probably not perform very well on them. So it’s important to look at robust estimators of performance that aren’t affected by large outliers.” Zheng motions that looking at the median absolute percentage is useful because it “gives us a relative measure of the typical error.”

Offline Evaluation Mechanisms

Zheng advocates in the paper that

“the model must be evaluated on a dataset that’s statistically independent from the one it was trained on. Why? Because its performance on the training set is an overly optimistic estimate of its true performance on new data. The process of training the model has already adapted to the training data. A more fair evaluation would measure the model’s performance on data that it hasn’t yet seen. In statistical terms, this gives an estimate of the generalization error, which measures how well the model generalizes to new data.“

Zheng also indicates that researchers can use hold-out validation as a way to generate the new data. Hold-out validation, “assuming that all data points are i.i.d. (independently and identically distributed), we simply randomly hold out part of the data for validation. We train the model on the larger portion of the data and evaluate validation metrics on the smaller hold-out set.” Zheng also points out resampling techniques such as bootstrapping or cross-validation may also be used when needing a mechanism that generates additional datasets. Bootstrapping “generates multiple datasets by sampling from a single, original dataset. Each of the “new” datasets can be used to estimate a quantity of interest. Since there are multiple datasets and therefore multiple estimates, one can also calculate things like the variance or a confidence interval for the estimate.” Cross validation, Zheng notes, is “useful when the training dataset is so small that one can’t afford to hold out part of the data just for validation purposes.” While there are many variants of cross-validation, one of the most commonly used is k-fold cross-validation which

“divides the training dataset into k-folds….each of the k folds takes turns being the hold-out validation set; a model is trained on the rest of the k -1 folds and measured on the held-out folds. The overall performance is taken to be the average of the performance on all k folds. Repeat this procedure for all of the hyperparameter settings that need to be evaluated, then pick the hyperparameters that resulted in the highest k-fold average.”

Zheng also points out that the sckit-learn cross-validation module may be useful.


As data scientists spend so much time on making models, considering evaluation metrics early on may help data scientists accelerate work and set up their projects for success. Yet, evaluating machine learning models is a known challenge. This Domino Data Science Field note provides a few insights excerpted from Zheng’s report. The full in-depth report is available for download. If you are interested in additional information on model development or model management then download Domino’s paper.

Domino Data Science Field Notes provides highlights of data science research, trends, techniques, and more, that support data scientists and data science leaders accelerate their work or careers. If you are interested in your data science work being covered in this blog series, please send us an email at writeforus(at)dominodatalab(dot)com.