The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation

BackgroundTo evaluate binary classifications and their confusion matrices, scientific researchers can employ several statistical rates, accordingly to the goal of the experiment they are investigating. Despite being a crucial issue in machine learning, no widespread consensus has been reached on a unified elective chosen measure yet. Accuracy and F1 score computed on confusion matrices have been (and still are) among the most popular adopted metrics in binary classification tasks. However, these statistical measures can dangerously show overoptimistic inflated results, especially on imbalanced datasets.ResultsThe Matthews correlation coefficient (MCC), instead, is a more reliable statistical rate which produces a high score only if the prediction obtained good results in all of the four confusion matrix categories (true positives, false negatives, true negatives, and false positives), proportionally both to the size of positive elements and the size of negative elements in the dataset.ConclusionsIn this article, we show how MCC produces a more informative and truthful score in evaluating binary classifications than accuracy and F1 score, by first explaining the mathematical properties, and then the asset of MCC in six synthetic use cases and in a real genomics scenario. We believe that the Matthews correlation coefficient should be preferred to accuracy and F1 score in evaluating binary classification tasks by all scientific communities.

[1]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[2]  T. Sørensen,et al.  A method of establishing group of equal amplitude in plant sociobiology based on similarity of species content and its application to analyses of the vegetation on Danish commons , 1948 .

[3]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[4]  C. J. van Rijsbergen,et al.  FOUNDATION OF EVALUATION , 1974 .

[5]  B. Matthews Comparison of the predicted and observed secondary structure of T4 phage lysozyme. , 1975, Biochimica et biophysica acta.

[6]  J. R. Landis,et al.  The measurement of observer agreement for categorical data. , 1977, Biometrics.

[7]  Jean M. Tague,et al.  The pragmatics of information retrieval experimentation , 1981 .

[8]  J. Hanley,et al.  The meaning and use of the area under a receiver operating characteristic (ROC) curve. , 1982, Radiology.

[9]  M. Appelbaum,et al.  Psychometric methods. , 1989, Annual review of psychology.

[10]  David D. Lewis,et al.  Evaluating Text Categorization I , 1991, HLT.

[11]  Nancy Chinchor,et al.  MUC-4 evaluation metrics , 1992, MUC.

[12]  Jean Tague-Sutcliffe,et al.  The Pragmatics of Information Retrieval Experimentation Revisited , 1997, Inf. Process. Manag..

[13]  Benoit M. Dawant,et al.  Morphometric analysis of white matter lesions in MR images: method and validation , 1994, IEEE Trans. Medical Imaging.

[14]  Fredric C. Gey,et al.  The relationship between recall and precision , 1994 .

[15]  Andrew P. Bradley,et al.  The use of the area under the ROC curve in the evaluation of machine learning algorithms , 1997, Pattern Recognit..

[16]  Marti A. Hearst Trends & Controversies: Support Vector Machines , 1998, IEEE Intell. Syst..

[17]  U. Alon,et al.  Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[18]  Jonathan Goldstein,et al.  When Is ''Nearest Neighbor'' Meaningful? , 1999, ICDT.

[19]  Pierre Baldi,et al.  Assessing the accuracy of prediction algorithms for classification: an overview , 2000, Bioinform..

[20]  J. Friedman Stochastic gradient boosting , 2002 .

[21]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[22]  José Salvador Sánchez,et al.  Strategies for learning in class imbalance problems , 2003, Pattern Recognit..

[23]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[24]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[25]  Yiming Yang,et al.  RCV1: A New Benchmark Collection for Text Categorization Research , 2004, J. Mach. Learn. Res..

[26]  Jan Gorodkin,et al.  Comparing two K-category assignments by a K-category correlation coefficient , 2004, Comput. Biol. Chem..

[27]  Roman Timofeev,et al.  Classification and Regression Trees(CART)Theory and Applications , 2004 .

[28]  Charles X. Ling,et al.  Using AUC and accuracy in evaluating learning algorithms , 2005, IEEE Transactions on Knowledge and Data Engineering.

[29]  N. F. F. Ebecken,et al.  On extending F-measure and G-mean metrics to multi-class problems , 2005, Data Mining VI.

[30]  George Hripcsak,et al.  Technical Brief: Agreement, the F-Measure, and Reliability in Information Retrieval , 2005, J. Am. Medical Informatics Assoc..

[31]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[32]  Tom Fawcett,et al.  An introduction to ROC analysis , 2006, Pattern Recognit. Lett..

[33]  Stan Szpakowicz,et al.  Beyond Accuracy, F-Score and ROC: A Family of Discriminant Measures for Performance Evaluation , 2006, Australian Conference on Artificial Intelligence.

[34]  Wei Xie,et al.  Accurate Cancer Classification Using Expressions of Very Few Genes , 2007, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[35]  Santosh K. Mishra,et al.  De novo SVM classification of precursor microRNAs from genomic pseudo hairpins using global and intrinsic folding measures , 2007, Bioinform..

[36]  Anne-Laure Boulesteix,et al.  Partial least squares: a versatile tool for the analysis of high-dimensional genomic data , 2006, Briefings Bioinform..

[37]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[38]  J. Xuan,et al.  Classification algorithms for phenotype prediction in genomics and proteomics. , 2008, Frontiers in bioscience : a journal and virtual library.

[39]  R. Real,et al.  AUC: a misleading measure of the performance of predictive distribution models , 2008 .

[40]  Arie Ben-David,et al.  Comparison of classification accuracy using Cohen's Weighted Kappa , 2008, Expert Syst. Appl..

[41]  José Hernández-Orallo,et al.  An experimental comparison of performance measures for classification , 2009, Pattern Recognit. Lett..

[42]  Leif E. Peterson K-nearest neighbor , 2009, Scholarpedia.

[43]  Guy Lapalme,et al.  A systematic analysis of performance measures for classification tasks , 2009, Inf. Process. Manag..

[44]  James D. Malley,et al.  Predictor correlation impacts machine learning algorithms: implications for genomic studies , 2009, Bioinform..

[45]  David J. Hand,et al.  Measuring classifier performance: a coherent alternative to the area under the ROC curve , 2009, Machine Learning.

[46]  Zhihua Cai,et al.  Evaluation Measures of the Classification Performance of Imbalanced Data Sets , 2009 .

[47]  David J Hand,et al.  Evaluating diagnostic tests: The area under the ROC curve and the balance of errors , 2010, Statistics in medicine.

[48]  Blaise Hanczar,et al.  Small-sample precision of ROC-related estimates , 2010, Bioinform..

[49]  Kevin C. Dorff,et al.  The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models , 2010, Nature Biotechnology.

[50]  José Salvador Sánchez,et al.  Theoretical Analysis of a Performance Measure for Imbalanced Data , 2010, 2010 20th International Conference on Pattern Recognition.

[51]  Qinghua Hu,et al.  A novel measure for evaluating classifiers , 2010, Expert Syst. Appl..

[52]  Joachim M. Buhmann,et al.  The Balanced Accuracy and Its Posterior Distribution , 2010, 2010 20th International Conference on Pattern Recognition.

[53]  C. Tappert,et al.  A Survey of Binary Similarity and Distance Measures , 2010 .

[54]  Rok Blagus,et al.  Class prediction for high-dimensional class-imbalanced data , 2010, BMC Bioinformatics.

[55]  Peter A. Flach,et al.  A Coherent Interpretation of AUC as a Measure of Aggregated Classification Performance , 2011, ICML.

[56]  Jan Hauke,et al.  Comparison of Values of Pearson's and Spearman's Correlation Coefficients on the Same Sets of Data , 2011 .

[57]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[58]  Veronica Cambiazo,et al.  Yeast-based assay identifies novel Shh/Gli target genes in vertebrate development , 2012, BMC Genomics.

[59]  Mohak Shah,et al.  Evaluating Learning Algorithms: Contents , 2011 .

[60]  Charles Parker,et al.  An Analysis of Performance Measures for Binary Classifiers , 2011, 2011 IEEE 11th International Conference on Data Mining.

[61]  Grigorios Tsoumakas,et al.  Random K-labelsets for Multilabel Classification , 2022 .

[62]  Philip Sedgwick,et al.  Pearson’s correlation coefficient , 2012, BMJ : British Medical Journal.

[63]  N. Adams,et al.  Measuring classification performance : the hmeasure package , 2012 .

[64]  M. Vihinen How to evaluate performance of prediction methods? Measures and their interpretation in variation effect analysis , 2012, BMC Genomics.

[65]  David M. W. Powers,et al.  The Problem with Kappa , 2012, EACL.

[66]  Cesare Furlanello,et al.  A Comparison of MCC and CEN Error Measures in Multi-Class Prediction , 2010, PloS one.

[67]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[68]  Mohamed Bekkar,et al.  Evaluation Measures for Models Assessment over Imbalanced Data Sets , 2013 .

[69]  David P. Kreil,et al.  A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control consortium , 2014, Nature Biotechnology.

[70]  Marco Masseroli,et al.  Extended Spearman and Kendall Coefficients for Gene Annotation List Correlation , 2014, CIBB.

[71]  J. Keilwagen,et al.  Area under Precision-Recall Curves for Weighted and Unweighted Data , 2014, PloS one.

[72]  Charles Elkan,et al.  Optimal Thresholding of Classifiers to Maximize F1 Measure , 2014, ECML/PKDD.

[73]  Fabien Subtil,et al.  The precision--recall curve overcame the optimism of the receiver operating characteristic curve in rare diseases. , 2015, Journal of clinical epidemiology.

[74]  S. Julious,et al.  The disagreeable behaviour of the kappa statistic , 2015, Pharmaceutical statistics.

[75]  Fabrizio Sebastiani,et al.  An Axiomatically Derived Measure for the Evaluation of Classification Algorithms , 2015, ICTIR.

[76]  Takaya Saito,et al.  The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets , 2015, PloS one.

[77]  Various performance measures in Binary classification –An Overview of ROC study , 2015 .

[78]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[79]  David M. W. Powers,et al.  What the F-measure doesn't measure: Features, Flaws, Fallacies and Fixes , 2015, ArXiv.

[80]  Jiujun Cheng,et al.  Research on the Matthews Correlation Coefficients Metrics of Personalized Recommendation Algorithm Evaluation , 2015 .

[81]  Peter A. Flach,et al.  Precision-Recall-Gain Curves: PR Analysis Done Right , 2015, NIPS.

[82]  Adam B. Yedidia Against the F-score , 2016 .

[83]  Sang-Tae Han,et al.  Comparison of the Performance Evaluations in Classification , 2016 .

[84]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[85]  Luís Torgo,et al.  A Survey of Predictive Modeling on Imbalanced Domains , 2016, ACM Comput. Surv..

[86]  K. Pollard,et al.  Enhancer–promoter interactions are encoded by complex genomic signatures on looping chromatin , 2016, Nature Genetics.

[87]  Claudia Biermann,et al.  Mathematical Methods Of Statistics , 2016 .

[88]  Davide Ballabio,et al.  Multivariate comparison of classification performance measures , 2017 .

[89]  Davide Chicco,et al.  Ten quick tips for machine learning in computational biology , 2017, BioData Mining.

[90]  Pedro J. Ballester,et al.  Precision and recall oncology: combining multiple gene mutations for improved identification of drug-sensitive tumours , 2017, Oncotarget.

[91]  Eamonn J. Keogh Nearest Neighbor , 2010, Encyclopedia of Machine Learning.

[92]  Sabri Boughorbel,et al.  Optimal classifier for imbalanced data using Matthews Correlation Coefficient metric , 2017, PloS one.

[93]  Josephine Sarpong Akosa,et al.  Predictive Accuracy : A Misleading Performance Measure for Highly Imbalanced Data , 2017 .

[94]  Fabio Roli,et al.  Designing multi-label classifiers that maximize F measures: State of the art , 2017, Pattern Recognit..

[95]  Michael A. Beer,et al.  Local epigenomic state cannot discriminate interacting and non-interacting enhancer–promoter pairs with high accuracy , 2018, bioRxiv.

[96]  Cesare Furlanello,et al.  Phylogenetic convolutional neural networks in metagenomics , 2017, BMC Bioinformatics.

[97]  Cesare Furlanello,et al.  Distillation of the clinical algorithm improves prognosis by multi-task deep learning in high-risk Neuroblastoma , 2018, PloS one.

[98]  Aman Dubey,et al.  Evaluation of Approximate Rank-Order Clustering using Matthews Correlation Coefficient , 2018 .

[99]  Alexander Dekhtyar,et al.  Information Retrieval , 2018, Lecture Notes in Computer Science.

[100]  Jaime S. Cardoso,et al.  Supervised deep learning embeddings for the prediction of cervical cancer diagnosis , 2018, PeerJ Comput. Sci..

[101]  Peter Christen,et al.  A note on using the F-measure for evaluating record linkage algorithms , 2017, Statistics and Computing.

[102]  J B Brown,et al.  Classifiers and their Metrics Quantified , 2018, Molecular informatics.

[103]  Michael M. Hoffman,et al.  Virtual ChIP-seq: predicting transcription factor binding by learning from the transcriptome , 2018, Genome Biology.

[104]  Amalia Luque,et al.  The impact of class imbalance in classification performance metrics based on the binary confusion matrix , 2019, Pattern Recognit..

[105]  Rosario Delgado,et al.  Enhancing Confusion Entropy (CEN) for binary and multiclass classification , 2019, PloS one.

[106]  Davide Chicco,et al.  Computational prediction of diagnosis and feature selection on mesothelioma patient health records , 2019, PloS one.

[107]  Xavier-Andoni Tibau,et al.  Why Cohen’s Kappa should be avoided as performance measure in classification , 2019, PloS one.

[108]  David M. W. Powers,et al.  Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation , 2011, ArXiv.

[109]  Alaa Tharwat,et al.  Classification assessment methods , 2020, Applied Computing and Informatics.