A Strategy on Selecting Performance Metrics for Classifier Evaluation

The evaluation of classifiers' performances plays a critical role in construction and selection of classification model. Although many performance metrics have been proposed in machine learning community, no general guidelines are available among practitioners regarding which metric to be selected for evaluating a classifier's performance. In this paper, we attempt to provide practitioners with a strategy on selecting performance metrics for classifier evaluation. Firstly, the authors investigate seven widely used performance metrics, namely classification accuracy, F-measure, kappa statistic, root mean square error, mean absolute error, the area under the receiver operating curve, and the area under the precision-recall curve. Secondly, the authors resort to using Pearson linear correlation and Spearman rank correlation to analyses the potential relationship among these seven metrics. Experimental results show that these commonly used metrics can be divided into three groups, and all metrics within a given group are highly correlated but less correlated with metrics from different groups.

[1]  Mehryar Mohri,et al.  AUC Optimization vs. Error Rate Minimization , 2003, NIPS.

[2]  Rich Caruana,et al.  Data mining in metric space: an empirical analysis of supervised learning performance criteria , 2004, ROCAI.

[3]  M. Pepe The Statistical Evaluation of Medical Tests for Classification and Prediction , 2003 .

[4]  Peter A. Flach,et al.  A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance , 2011, ICML.

[5]  David J. Hand,et al.  Construction and Assessment of Classification Rules , 1997 .

[6]  Peter A. Flach,et al.  A Unified View of Performance Metrics: Translating Threshold Choice into Expected Classification Loss C` Esar Ferri , 2012 .

[7]  Lucila Ohno-Machado,et al.  The use of receiver operating characteristic curves in biomedical informatics , 2005, J. Biomed. Informatics.

[8]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[9]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[10]  M. Kendall,et al.  Rank Correlation Methods , 1949 .

[11]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[12]  Jingjing Lu,et al.  Comparing naive Bayes, decision trees, and SVM with AUC and accuracy , 2003, Third IEEE International Conference on Data Mining.

[13]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[14]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[15]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[16]  Andrew P. Bradley,et al.  Half-AUC for the evaluation of sensitive or specific classifiers , 2014, Pattern Recognit. Lett..

[17]  Arie Ben-David,et al.  A lot of randomness is hiding in accuracy , 2007, Eng. Appl. Artif. Intell..

[18]  Terry Ngo,et al.  Data mining: practical machine learning tools and technique, third edition by Ian H. Witten, Eibe Frank, Mark A. Hell , 2011, SOEN.

[19]  Mithat Gonen,et al.  Analyzing Receiver Operating Characteristic Curves with SAS , 2007 .

[20]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[21]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[22]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[23]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[24]  Taghi M. Khoshgoftaar,et al.  A Study on the Relationships of Classifier Performance Metrics , 2009, 2009 21st IEEE International Conference on Tools with Artificial Intelligence.

[25]  Vijay V. Raghavan,et al.  A critical investigation of recall and precision as measures of retrieval system performance , 1989, TOIS.

[26]  Ronald Rousseau,et al.  Requirements for a cocitation similarity measure, with special reference to Pearson's correlation coefficient , 2003, J. Assoc. Inf. Sci. Technol..

[27]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[28]  David J. Hand,et al.  ROC Curves for Continuous Data , 2009 .