论文信息 - A systematic analysis of performance measures for classification tasks

A systematic analysis of performance measures for classification tasks

This paper presents a systematic analysis of twenty four performance measures used in the complete spectrum of Machine Learning classification tasks, i.e., binary, multi-class, multi-labelled, and hierarchical. For each classification task, the study relates a set of changes in a confusion matrix to specific characteristics of data. Then the analysis concentrates on the type of changes to a confusion matrix that do not change a measure, therefore, preserve a classifier's evaluation (measure invariance). The result is the measure invariance taxonomy with respect to all relevant label distribution changes in a classification problem. This formal analysis is supported by examples of applications where invariance properties of measures lead to a more reliable evaluation of classifiers. Text classification supplements the discussion with several case studies.

Guy Lapalme | Marina Sokolova | G. Lapalme | Marina Sokolova

[1] Éric Gaussier,et al. A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation , 2005, ECIR.

[2] Duane Szafron,et al. Improving Protein Function Prediction using the Hierarchical Structure of the Gene Ontology , 2005, 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology.

[3] Fabrizio Sebastiani,et al. Machine learning in automated text categorization , 2001, CSUR.

[4] Eric K. Ringger,et al. Pulse: Mining Customer Opinions from Free Text , 2005, IDA.

[5] John Shawe-Taylor,et al. The Set Covering Machine , 2003, J. Mach. Learn. Res..

[6] Eisaku Maeda,et al. Maximal Margin Labeling for Multi-Topic Text Categorization , 2004, NIPS.

[7] Janez Demsar,et al. Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[8] Alex A. Freitas,et al. A review of performance evaluation measures for hierarchical classifiers , 2007 .

[9] Pat Langley,et al. Elements of Machine Learning , 1995 .

[10] Diane M. Strong,et al. Beyond Accuracy: What Data Quality Means to Data Consumers , 1996, J. Manag. Inf. Syst..

[12] Yihong Gong,et al. Multi-labelled classification using maximum entropy method , 2005, SIGIR '05.

[13] Jacob Cohen. Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[14] Hans-Werner Mewes,et al. MIPS: a database for protein sequences, homology data and yeast genome information , 1997, Nucleic Acids Res..

[15] Maurice Bruynooghe,et al. Hierarchical multi-classification , 2002, KDD 2002.

[16] Bo Pang,et al. Thumbs up? Sentiment Classification using Machine Learning Techniques , 2002, EMNLP.

[17] Alexander Dekhtyar,et al. Information Retrieval , 2018, Lecture Notes in Computer Science.

[18] Regina Barzilay,et al. Database-Text Alignment via Structured Multilabel Classification , 2007, IJCAI.

[19] Stan Szpakowicz,et al. Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[20] Chris Buckley,et al. OHSUMED: an interactive retrieval evaluation and new large test collection for research , 1994, SIGIR '94.

[21] Ee-Peng Lim,et al. Performance measurement framework for hierarchical text classification , 2003, J. Assoc. Inf. Sci. Technol..

[22] W. Youden,et al. Index for rating diagnostic tests , 1950, Cancer.

[23] Jaideep Srivastava,et al. Selecting the right objective measure for association analysis , 2004, Inf. Syst..

[24] Janyce Wiebe,et al. RECOGNIZING STRONG AND WEAK OPINION CLAUSES , 2006, Comput. Intell..

[25] N. Japkowicz. Why Question Machine Learning Evaluation Methods ? ( An illustrative review of the shortcomings of current methods ) , 2006 .

[26] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[27] Peter A. Flach,et al. Improving Accuracy and Cost of Two-class and Multi-class Probabilistic Classifiers Using ROC Curves , 2003, ICML.

[28] Kamal Nigam,et al. Towards a Robust Metric of Opinion , 2004 .

[29] Domonkos Tikk,et al. Experiments with multi-label text classifier on the Reuters collection , 2003 .

[30] Stan Matwin,et al. Learning and Evaluation in the Presence of Class Hierarchies: Application to Text Categorization , 2006, Canadian AI.

[31] Nello Cristianini,et al. Kernel Methods for Pattern Analysis , 2003, ICTAI.

[32] Samy Bengio,et al. The Expected Performance Curve , 2003, ICML 2003.

[33] Shenghuo Zhu,et al. Empirical Studies on Multi-label Classification , 2006, 2006 18th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'06).

[34] Victoria Bobicev,et al. An Effective and Robust Method for Short Text Classification , 2008, AAAI.

[35] Guy Lapalme,et al. Performance Measures in Classification of Human Communications , 2007, Canadian Conference on AI.

[36] John Blitzer,et al. Biographies, Bollywood, Boom-boxes and Blenders: Domain Adaptation for Sentiment Classification , 2007, ACL.

[37] Matt Thomas,et al. Get out the vote: Determining support or opposition from Congressional floor-debate transcripts , 2006, EMNLP.

[38] Thomas G. Dietterich. What is machine learning? , 2020, Archives of Disease in Childhood.

[39] Charles X. Ling,et al. Constructing New and Better Evaluation Measures for Machine Learning , 2007, IJCAI.

[40] Steven L. Salzberg. On Comparing Classifiers: A Critique of Current Research and Methods , 1999 .

[41] Yiming Yang,et al. An Evaluation of Statistical Approaches to Text Categorization , 1999, Information Retrieval.

[42] Thomas G. Dietterich. Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[43] Sophia Ananiadou,et al. Multi-topic Aspects in Clinical Text Classification , 2007, 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM 2007).

[44] Richard O. Duda,et al. Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.