The Prediction Advantage: A Universally Meaningful Performance Measure for Classification and Regression

We introduce the Prediction Advantage (PA), a novel performance measure for prediction functions under any loss function (e.g., classification or regression). The PA is defined as the performance advantage relative to the Bayesian risk restricted to knowing only the distribution of the labels. We derive the PA for well-known loss functions, including 0/1 loss, cross-entropy loss, absolute loss, and squared loss. In the latter case, the PA is identical to the well-known R-squared measure, widely used in statistics. The use of the PA ensures meaningful quantification of prediction performance, which is not guaranteed, for example, when dealing with noisy imbalanced classification problems. We argue that among several known alternative performance measures, PA is the best (and only) quantity ensuring meaningfulness for all noise and imbalance levels.

[1]  Ran El-Yaniv,et al.  The Relationship Between Agnostic Selective Classification Active Learning and the Disagreement Coefficient , 2017, J. Mach. Learn. Res..

[2]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[3]  C. K. Chow,et al.  On optimum recognition error and reject tradeoff , 1970, IEEE Trans. Inf. Theory.

[4]  B. Efron Regression and ANOVA with Zero-One Data: Measures of Residual Variation , 1978 .

[5]  Ran Xu,et al.  Random forests for metric learning with implicit pairwise position dependence , 2012, KDD.

[6]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[7]  Gary E. Birch,et al.  Comparison of Evaluation Metrics in Classification Applications with Imbalanced Datasets , 2008, 2008 Seventh International Conference on Machine Learning and Applications.

[8]  Ran El-Yaniv,et al.  Agnostic Pointwise-Competitive Selective Classification , 2015, J. Artif. Intell. Res..

[9]  Xiangliang Zhang,et al.  Support vector machines with indefinite kernels , 2014, ACML.

[10]  B. Chandra,et al.  Fuzzifying Gini Index based decision trees , 2009, Expert Syst. Appl..

[11]  Martti Juhola,et al.  Missing values: how many can they be to preserve classification reliability? , 2011, Artificial Intelligence Review.

[12]  Allan Borodin,et al.  On Randomization in On-Line Computation , 1999, Inf. Comput..

[13]  Inge S. Helland,et al.  On the Interpretation and Use of R 2 in Regression Analysis , 1987 .

[14]  Yang Wang,et al.  An Effective Integrated Method for Learning Big Imbalanced Data , 2014, 2014 IEEE International Congress on Big Data.

[15]  Ran El-Yaniv,et al.  Agnostic Selective Classification , 2011, NIPS.

[16]  Bernhard Schölkopf,et al.  Correcting Sample Selection Bias by Unlabeled Data , 2006, NIPS.

[17]  B. Chandra,et al.  A Robust Algorithm for Classification Using Decision Trees , 2006, 2006 IEEE Conference on Cybernetics and Intelligent Systems.

[18]  Ran El-Yaniv,et al.  On the Foundations of Noise-free Selective Classification , 2010, J. Mach. Learn. Res..

[19]  D. McFadden Conditional logit analysis of qualitative choice behavior , 1972 .

[20]  Andrew Chi-Chih Yao,et al.  Probabilistic computations: Toward a unified measure of complexity , 1977, 18th Annual Symposium on Foundations of Computer Science (sfcs 1977).

[21]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[22]  David B. Skillicorn,et al.  Building predictors from vertically distributed data , 2004, CASCON.