A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss C` Esar Ferri

Many performance metrics have been introduced in the literature for the evaluation of classification performance, each of them with different origins and areas of application. These metrics include accuracy, unweighted accuracy, the area under the ROC curve or the ROC convex hull, the mean absolute error and the Brier score or mean squared error (with its decomposition into refinement and calibration). One way of understanding the relations among these metrics is by means of variable operating conditions (in the form of misclassification costs and/or class distributions). Thus, a metric may correspond to some expected loss over different operating conditions. One dimension for the analysis has been the distribution for this range of operating conditions, leading to some important connections in the area of proper scoring rules. We demonstrate in this paper that there is an equally important dimension which has so far received much less attention in the analysis of performance metrics. This dimension is given by the decision rule, which is typically implemented as a threshold choice method when using scoring models. In this paper, we explore many old and new threshold choice methods: fixed, score-uniform, score-driven, rate-driven and optimal, among others. By calculating the expected loss obtained with these threshold choice methods for a uniform range of operating conditions we give clear interpretations of the 0-1 loss, the absolute error, the Brier score, the AUC and the refinement loss respectively. Our analysis provides a comprehensive view of performance metrics as well as a systematic approach to loss minimisation which can be summarised as follows: given a model, apply the threshold choice methods that correspond with the available information about the operating condition, and compare their expected losses. In order to assist in this procedure we also derive several connections between the aforementioned performance metrics, and we highlight the role of calibration in choosing the threshold choice method.

[1]  Peter A. Flach,et al.  ROC curves in cost space , 2013, Machine Learning.

[2]  Peter A. Flach,et al.  Machine Learning - The Art and Science of Algorithms that Make Sense of Data , 2012 .

[3]  Christoforos Anagnostopoulos,et al.  A better Beta for the H measure of classification performance , 2012, Pattern Recognit. Lett..

[4]  Gustavo E. A. P. A. Batista,et al.  A Survey on Graphical Methods for Classification Predictive Performance Evaluation , 2011, IEEE Transactions on Knowledge and Data Engineering.

[5]  Peter A. Flach,et al.  A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance , 2011, ICML.

[6]  Peter A. Flach,et al.  Brier Curves: a New Cost-Based Visualisation of Classifier Performance , 2011, ICML.

[7]  José Hernández-Orallo,et al.  Quantification via Probability Estimators , 2010, 2010 IEEE International Conference on Data Mining.

[8]  Niko Brümmer,et al.  Measuring, refining and calibrating speaker and language information extracted from speech , 2010 .

[9]  David J Hand,et al.  Evaluating diagnostic tests: The area under the ROC curve and the balance of errors , 2010, Statistics in medicine.

[10]  Mark D. Reid,et al.  Composite Binary Losses , 2009, J. Mach. Learn. Res..

[11]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[12]  Martin Gebel,et al.  Multivariate calibration of classifier scores into the probability space , 2009 .

[13]  Mark D. Reid,et al.  Information, Divergence and Risk for Binary Experiments , 2009, J. Mach. Learn. Res..

[14]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[15]  George Forman,et al.  Quantifying counts and costs via classification , 2008, Data Mining and Knowledge Discovery.

[16]  Peter A. Flach,et al.  A Simple Lexicographic Ranker and Probability Estimator , 2007, ECML.

[17]  Tom Fawcett,et al.  PAV and the ROC convex hull , 2007, Machine Learning.

[18]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[19]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[20]  Stefan Rüping,et al.  Robust Probabilistic Calibration , 2006, ECML.

[21]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[22]  Rich Caruana,et al.  Predicting good probabilities with supervised learning , 2005, ICML.

[23]  Rich Caruana,et al.  Obtaining Calibrated Probabilities from Boosting , 2005, UAI.

[24]  Moisés Goldszmidt,et al.  Properties and Benefits of Calibrated Classifiers , 2004, PKDD.

[25]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[26]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[27]  Margaret Sullivan Pepe,et al.  Estimating disease prevalence in two-phase studies. , 2003, Biostatistics.

[28]  Peter D. Turney Types of Cost in Inductive Concept Learning , 2002, ArXiv.

[29]  Bianca Zadrozny,et al.  Transforming classifier scores into accurate multiclass probability estimates , 2002, KDD.

[30]  John D. Lafferty,et al.  Cranking: Combining Rankings Using Conditional Probability Models on Permutations , 2002, ICML.

[31]  Tom Fawcett,et al.  Using rule sets to maximize ROC performance , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[32]  Charles Elkan,et al.  The Foundations of Cost-Sensitive Learning , 2001, IJCAI.

[33]  Bianca Zadrozny,et al.  Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers , 2001, ICML.

[34]  J A Swets,et al.  Better decisions through science. , 2000, Scientific American.

[35]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[36]  Robert C. Holte,et al.  Explicitly representing expected cost: an alternative to ROC representation , 2000, KDD '00.

[37]  Gregory Piatetsky-Shapiro,et al.  Estimating campaign benefits and modeling lift , 1999, KDD '99.

[38]  Mark R. Wade,et al.  Construction and Assessment of Classification Rules , 1999, Technometrics.

[39]  Niall M. Adams,et al.  Comparing classifiers when the misallocation costs are uncertain , 1999, Pattern Recognit..

[40]  Alvin F. Martin,et al.  The DET curve in assessment of detection task performance , 1997, EUROSPEECH.

[41]  Tom Fawcett,et al.  Adaptive Fraud Detection , 1997, Data Mining and Knowledge Discovery.

[42]  Barry R. James,et al.  A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data , 1989 .

[43]  F. T. Wright,et al.  Order restricted statistical inference , 1988 .

[44]  M C Weinstein,et al.  Performance of screening and diagnostic tests. Application of receiver operating characteristic analysis. , 1987, Archives of general psychiatry.

[45]  A. H. Murphy A New Vector Partition of the Probability Score , 1973 .

[46]  A. Tenenbein A Double Sampling Scheme for Estimating from Binomial Data with Misclassifications , 1970 .

[47]  A. H. Murphy,et al.  Measures of the Utility of Probabilistic Predictions in Cost-Loss Ratio Decision Situations in which Knowledge of the Cost-Loss Ratios is Incomplete , 1969 .

[48]  A. H. Murphy,et al.  A Note on the Utility of Probabilistic Predictions and the Probability Score in the Cost-Loss Ratio Decision Situation , 1966 .

[49]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[50]  J. Neyman Contribution to the Theory of Sampling Human Populations , 1938 .

[51]  José Hernández-Orallo,et al.  Calibration of Machine Learning Models , 2012 .

[52]  Peter A. Flach,et al.  Modifying ROC Curves to Incorporate Predicted Probabilities , 2005 .

[53]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[54]  G. Grudic,et al.  Loss Functions for Binary Class Probability Estimation , 2003 .

[55]  John C. Platt,et al.  Probabilistic Outputs for Support vector Machines and Comparisons to Regularized Likelihood Methods , 1999 .

[56]  A. H. Murphy,et al.  Scoring rules in probability assessment and evaluation , 1970 .

[57]  G. Brier VERIFICATION OF FORECASTS EXPRESSED IN TERMS OF PROBABILITY , 1950 .