An Analysis of Performance Measures for Binary Classifiers

If one is given two binary classifiers and a set of test data, it should be straightforward to determine which of the two classifiers is the superior. Recent work, however, has called into question many of the methods heretofore accepted as standard for this task. In this paper, we analyze seven ways of determining if one classifier is better than another, given the same test data. Five of these are long established and two are relative newcomers. We review and extend work showing that one of these methods is clearly inappropriate, and then conduct an empirical analysis with a large number of datasets to evaluate the real-world implications of our theoretical analysis. Both our empirical and theoretical results converge strongly towards one of the newer methods.

[1]  Jieping Ye,et al.  Drosophila gene expression pattern annotation using sparse features and term-term interactions , 2009, KDD.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  Wei-Yin Loh,et al.  Classification and regression trees , 2011, WIREs Data Mining Knowl. Discov..

[4]  Lori E. Dodd,et al.  Partial AUC Estimation and Regression , 2003, Biometrics.

[5]  Arie Ben-David,et al.  About the relationship between ROC curves and Cohen's kappa , 2008, Eng. Appl. Artif. Intell..

[6]  Hyun-Chul Kim,et al.  Bayesian Classifier Combination , 2012, AISTATS.

[7]  Charles Parker An Empirical Study of Feature Extraction Methods for Audio Classification , 2010, 2010 20th International Conference on Pattern Recognition.

[8]  D. McClish Analyzing a Portion of the ROC Curve , 1989, Medical decision making : an international journal of the Society for Medical Decision Making.

[9]  Ian H. Witten,et al.  Data mining - practical machine learning tools and techniques, Second Edition , 2005, The Morgan Kaufmann series in data management systems.

[10]  Ben-DavidArie About the relationship between ROC curves and Cohen's kappa , 2008 .

[11]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[12]  Peter A. Flach,et al.  A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance , 2011, ICML.

[13]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[14]  Giorgio Valentini,et al.  Low Bias Bagged Support Vector Machines , 2003, ICML.

[15]  Emine Yilmaz,et al.  A geometric interpretation and analysis of R-precision , 2005, CIKM '05.

[16]  David J. Hand,et al.  A Simple Generalisation of the Area Under the ROC Curve for Multiple Class Classification Problems , 2001, Machine Learning.

[17]  Jason Weston,et al.  A kernel method for multi-labelled classification , 2001, NIPS.

[18]  Emine Yilmaz,et al.  Estimating average precision when judgments are incomplete , 2007, Knowledge and Information Systems.

[19]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[20]  George Forman,et al.  An Extensive Empirical Study of Feature Selection Metrics for Text Classification , 2003, J. Mach. Learn. Res..

[21]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[22]  Tom Fawcett,et al.  Robust Classification for Imprecise Environments , 2000, Machine Learning.

[23]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[24]  E. B. Wilson Probable Inference, the Law of Succession, and Statistical Inference , 1927 .

[25]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[26]  Eric O. Postma,et al.  Dimensionality Reduction: A Comparative Review , 2008 .

[27]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[28]  Robert C. Holte,et al.  Cost curves: An improved method for visualizing classifier performance , 2006, Machine Learning.