Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation

Commonly used evaluation measures including Recall, Precision, F-Factor and Rand Accuracy are biased and should not be used without clear understanding of the biases, and corresponding identification of chance or base case levels of the statistic. Using these measures a system that performs worse in the objective sense of Informedness, can appear to perform better under any of these commonly used measures. We discuss several concepts and measures that reflect the probability that prediction is informed versus chance. Informedness and introduce Markedness as a dual measure for the probability that prediction is marked versus chance. Finally we demonstrate elegant connections between the concepts of Informedness, Markedness, Correlation and Significance as well as their intuitive relationships with Recall and Precision, and outline the extension from the dichotomous case to the general multi-class case. .

[1]  Sandip Sinharay,et al.  Editors Appointed for Journal of Educational and Behavioral Statistics , 2010 .

[2]  Douglas G. Bonett,et al.  Inferential Methods for the Tetrachoric Correlation Coefficient , 2005 .

[3]  Ronald W. Manderscheid,et al.  Approximating the Moments and Distribution of the Likelihood Ratio Statistic for Multinomial Goodness of Fit , 1981 .

[4]  Johannes Fürnkranz,et al.  ROC ‘n’ Rule Learning—Towards a Better Understanding of Covering Algorithms , 2005, Machine Learning.

[5]  J. A. Adams,et al.  Psychological bulletin. , 1962, Psychological bulletin.

[6]  Pieter Reitsma,et al.  Educational and Psychological Measurement , 2003 .

[7]  D. Shanks Is Human Learning Rational? , 1995, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[8]  Trent W. Lewis,et al.  Audio-Visual Speech Recognition Using Red Exclusion and Neural Networks , 2002, ACSC.

[9]  Raj Madhavan,et al.  Performance Metrics for Intelligent Systems (PerMIS) 2006Workshop: Summary and Review , 2006, 35th IEEE Applied Imagery and Pattern Recognition Workshop (AIPR'06).

[10]  F. James Rohlf,et al.  Biometry: The Principles and Practice of Statistics in Biological Research , 1969 .

[11]  Pierre Perruchet,et al.  The exploitation of distributional information in syllable processing , 2004, Journal of Neurolinguistics.

[12]  S. Tipper,et al.  Quarterly Journal of Experimental Psychology , 1948, Nature.

[13]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[14]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[15]  David Williams,et al.  Improved likelihood ratio tests for complete contingency tables , 1976 .

[16]  John S. Uebersax,et al.  Diversity of decision-making models and the measurement of interrater agreement. , 1987 .

[17]  Heejin Lee Journal of Research and Practice in Information Technology: Guest editorial , 2010 .

[18]  T P Hutchinson,et al.  Focus on Psychometrics. Kappa muddles together two sources of disagreement: tetrachoric correlation is preferable. , 1993, Research in nursing & health.

[19]  Taylor Francis Online,et al.  The American statistician , 1947 .

[20]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[21]  Peter A. Flach The Geometry of ROC Space: Understanding Machine Learning Metrics through ROC Isometrics , 2003, ICML.

[22]  R. Lowry,et al.  Concepts and Applications of Inferential Statistics , 2014 .

[23]  K. Roeder,et al.  Comment , 2006 .

[24]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[25]  K. Pearson Biometrika , 1902, The American Naturalist.

[26]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[27]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[28]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[29]  H. S. Wilson,et al.  Research in Nursing , 1985 .

[30]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[31]  北本 朝展 20th International Conference on Machine Learning(ICML 2003)8/21〜8/24および9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2003)8/24〜8/27・ワシントンDC , 2004 .

[32]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[33]  D. Signorini,et al.  Neural networks , 1995, The Lancet.

[34]  David M. W. Powers,et al.  ADABOOK & MULTIBOOK: Adaptive Boosting with Chance Correction , 2020, ArXiv.

[35]  Peter A. Flach,et al.  Ninth International Workshop on Inductive Logic Programming (ILP'99) , 1999 .

[36]  Paulo J. G. Lisboa,et al.  Bias reduction in skewed binary classification with Bayesian neural networks , 2000, Neural Networks.

[37]  Peter A. Flach,et al.  Rule Evaluation Measures: A Unifying View , 1999, ILP.

[38]  Elena R. Messina,et al.  Performance Metrics for Intelligent Systems , 2000 .

[39]  Nick Cercone,et al.  Computational Linguistics , 1986, Communications in Computer and Information Science.