A systematic analysis of performance measures for classification tasks

This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

[1]  Éric Gaussier,et al.  A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[2]  Duane Szafron,et al.  Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[3]  Fabrizio Sebastiani,et al.  Machine learning in automated text categorization , 2001, CSUR.

[4]  Eric K. Ringger,et al.  Pulse: Mining Customer Opinions from Free Text , 2005, IDA.

[5]  John Shawe-Taylor,et al.  The Set Covering Machine , 2003, J. Mach. Learn. Res..

[6]  Eisaku Maeda,et al.  Maximal Margin Labeling for Multi-Topic Text Categorization , 2004, NIPS.

[7]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[8]  Alex A. Freitas,et al.  A review of performance evaluation measures for hierarchical classifiers , 2007 .

[9]  Pat Langley,et al.  Elements of Machine Learning , 1995 .

[10]  Diane M. Strong,et al.  Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[12]  Yihong Gong,et al.  Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[13]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[14]  Hans-Werner Mewes,et al.  MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[15]  Maurice Bruynooghe,et al.  Hierarchical multi-classification , 2002, KDD 2002.

[16]  Bo Pang,et al.  Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[17]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[18]  Regina Barzilay,et al.  Database-Text Alignment via Structured Multilabel Classification , 2007, IJCAI.

[19]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[20]  Chris Buckley,et al.  OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[21]  Ee-Peng Lim,et al.  Performance measurement framework for hierarchical text classification , 2003, J. Assoc. Inf. Sci. Technol..

[22]  W. Youden,et al.  Index for rating diagnostic tests , 1950, Cancer.

[23]  Jaideep Srivastava,et al.  Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[24]  Janyce Wiebe,et al.  RECOGNIZING STRONG AND WEAK OPINION CLAUSES , 2006, Comput. Intell..

[25]  N. Japkowicz Why Question Machine Learning Evaluation Methods ? ( An illustrative review of the shortcomings of current methods ) , 2006 .

[26]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[27]  Peter A. Flach,et al.  Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[28]  Kamal Nigam,et al.  Towards a Robust Metric of Opinion , 2004 .

[29]  Domonkos Tikk,et al.  Experiments with multi-label text classifier on the Reuters collection , 2003 .

[30]  Stan Matwin,et al.  Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[31]  Nello Cristianini,et al.  Kernel Methods for Pattern Analysis , 2003, ICTAI.

[32]  Samy Bengio,et al.  The Expected Performance Curve , 2003, ICML 2003.

[33]  Shenghuo Zhu,et al.  Empirical Studies on Multi-label Classification , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[34]  Victoria Bobicev,et al.  An Effective and Robust Method for Short Text Classification , 2008, AAAI.

[35]  Guy Lapalme,et al.  Performance Measures in Classification of Human Communications , 2007, Canadian Conference on AI.

[36]  John Blitzer,et al.  Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[37]  Matt Thomas,et al.  Get out the vote: Determining support or opposition from Congressional floor-debate transcripts , 2006, EMNLP.

[38]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[39]  Charles X. Ling,et al.  Constructing New and Better Evaluation Measures for Machine Learning , 2007, IJCAI.

[40]  Steven L. Salzberg On Comparing Classifiers: A Critique of Current Research and Methods , 1999 .

[41]  Yiming Yang,et al.  An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[42]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[43]  Sophia Ananiadou,et al.  Multi-topic Aspects in Clinical Text Classification , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[44]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.