Assessing the Performance of Classification Methods

Summary A large number of measures have been developed for evaluating the performance of classification rules. Some of these have been developed to meet the practical requirements of specific applications, but many others—which here we call “classification accuracy” criteria—represent different ways of balancing the different kinds of misclassification which may be made. This paper reviews classification accuracy criteria. However, the literature is now so large and diverse that a comprehensive list, covering all the measures and their variants, would probably be impossible. Instead, this paper embeds such measures in general framework, spanning the possibilities, and draws attention to relationships between them. Important points to note are, firstly, that different performance measures, by definition, measure different aspects of performance; secondly, that one should therefore carefully choose a measure to match the objectives of one's study; and, thirdly, that empirical comparisons between instruments measuring different aspects are of limited value.

[1]  P. Williams,et al.  The problem of screening for uncommon disorders – a comment on the Eating Attitudes Test , 1982, Psychological Medicine.

[2]  S D Walter,et al.  The partial area under the summary ROC curve , 2005, Statistics in medicine.

[3]  Russell Zaretzki,et al.  The Skill Plot: A Graphical Technique for Evaluating Continuous Diagnostic Tests , 2007, Biometrics.

[4]  R. Stine,et al.  Non-parametric estimates of overlap. , 2001, Statistics in medicine.

[5]  David J. Hand,et al.  Mining Supervised Classification Performance Studies: A Meta-Analytic Investigation , 2008, J. Classif..

[6]  D. J. Hand,et al.  Good practice in retail credit scorecard assessment , 2005, J. Oper. Res. Soc..

[7]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.

[8]  Mitchell H. Gail,et al.  A family of nonparametric statistics for comparing diagnostic markers with paired or unpaired data , 1989 .

[9]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[10]  D. Rom,et al.  Testing for individual and population equivalence based on the proportion of similar responses. , 1996, Statistics in medicine.

[11]  David J. Hand,et al.  Deconstructing Statistical Questions , 1994 .

[12]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[13]  David J Hand,et al.  Evaluating diagnostic tests: The area under the ROC curve and the balance of errors , 2010, Statistics in medicine.

[14]  Henry F. Inman,et al.  The overlapping coefficient as a measure of agreement between probability distributions and point estimation of the overlap of two normal densities , 1989 .

[15]  Nancy R Cook,et al.  Using relative utility curves to evaluate risk prediction , 2009, Journal of the Royal Statistical Society. Series A,.

[16]  A. Raftery,et al.  Strictly Proper Scoring Rules, Prediction, and Estimation , 2007 .

[17]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[18]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[19]  R. L. Winkler Scoring Rules and the Evaluation of Probability Assessors , 1969 .

[20]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.