Is p-value $<$ 0.05 enough? A study on the statistical evaluation of classifiers

Abstract Statistical significance analysis, based on hypothesis tests, is a common approach for comparing classifiers. However, many studies oversimplify this analysis by simply checking the condition p-value < 0.05, ignoring important concepts such as the effect size and the statistical power of the test. This problem is so worrying that the American Statistical Association has taken a strong stand on the subject, noting that although the p-value is a useful statistical measure, it has been abusively used and misinterpreted. This work highlights problems caused by the misuse of hypothesis tests and shows how the effect size and the power of the test can provide important information for better decision-making. To investigate these issues, we perform empirical studies with different classifiers and 50 datasets, using the Student’s t-test and the Wilcoxon test to compare classifiers. The results show that an isolated p-value analysis can lead to wrong conclusions and that the evaluation of the effect size and the power of the test contributes to a more principled decision-making.

[1]  Alex A. Freitas,et al.  Is P-value<0.05 Enough? Two Case Studies in Classifiers Evaluation , 2018, Anais do XV Encontro Nacional de Inteligência Artificial e Computacional (ENIAC 2018).

[2]  Jie Lu,et al.  A Bayesian nonparametric model for multi-label learning , 2017, Machine Learning.

[3]  Xiangfeng Luo,et al.  A Bayesian nonparametric model for multi-label learning , 2017, Machine Learning.

[4]  Hsuan-Tien Lin,et al.  Cost-sensitive label embedding for multi-label classification , 2017, Machine Learning.

[5]  João Gama,et al.  Weightless neural networks for open set recognition , 2017, Machine Learning.

[6]  Talel Abdessalem,et al.  Adaptive random forests for evolving data stream classification , 2017, Machine Learning.

[7]  John Shawe-Taylor,et al.  High-probability minimax probability machines , 2017, Machine Learning.

[8]  Hsuan-Tien Lin,et al.  Progressive random k-labelsets for cost-sensitive multi-label classification , 2017, Machine Learning.

[9]  Dimitris Bertsimas,et al.  Optimal classification trees , 2017, Machine Learning.

[10]  Fei Yu,et al.  Maximum margin partial label learning , 2017, Machine Learning.

[11]  Geoffrey I. Webb,et al.  Efficient parameter learning of Bayesian network classifiers , 2017, Machine Learning.

[12]  Ricardo da Silva Torres,et al.  Nearest neighbors distance ratio open-set classifier , 2016, Machine Learning.

[13]  Hsuan-Tien Lin,et al.  Progressive random k-labelsets for cost-sensitive multi-label classification , 2016, Machine Learning.

[14]  Gang Niu,et al.  Class-prior estimation for learning from positive and unlabeled data , 2016, Machine Learning.

[15]  J. R. Quevedo,et al.  A family of admissible heuristics for A* to perform inference in probabilistic classifier chains , 2016, Machine Learning.

[16]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[17]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[18]  Hsuan-Tien Lin,et al.  Cost-sensitive label embedding for multi-label classification , 2016, Machine Learning.

[19]  Marco Loog,et al.  Projected estimators for robust semi-supervised classification , 2016, Machine Learning.

[20]  Saso Dzeroski,et al.  Multi-label classification via multi-target regression on data streams , 2015, Machine Learning.

[21]  Masashi Sugiyama,et al.  Homotopy continuation approaches for robust SV classification and regression , 2015, Machine Learning.

[22]  Wojciech Kotlowski,et al.  Surrogate regret bounds for generalized classification performance metrics , 2015, Machine Learning.

[23]  Dongwoo Kim,et al.  Hierarchical Dirichlet scaling process , 2014, Machine Learning.

[24]  Gail M. Sullivan,et al.  Using Effect Size-or Why the P Value Is Not Enough. , 2012, Journal of graduate medical education.

[25]  S. Sereika,et al.  Effect size estimation: methods and examples. , 2012, International journal of nursing studies.

[26]  Jennifer J. Richler,et al.  Effect size estimates: current use, calculations, and interpretation. , 2012, Journal of experimental psychology. General.

[27]  Mohak Shah,et al.  Evaluating Learning Algorithms: A Classification Perspective , 2011 .

[28]  Andreas Christmann,et al.  Support vector machines , 2008, Data Mining and Knowledge Discovery Handbook.

[29]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[30]  Josmar Mazucheli,et al.  Um estudo sobre o tamanho e poder dos testes t-Student e Wilcoxon , 2005 .

[31]  Donald F. Klein,et al.  Beyond Significance Testing: Reforming Data Analysis Methods in Behavioral Research , 2005 .

[32]  Marti A. Hearst Support vector machines , 1998 .

[33]  Kent B. Monroe,et al.  Effect-Size Estimates: Issues and Problems in Interpretation , 1996 .

[34]  Patricia Snyder,et al.  Evaluating Results Using Corrected and Uncorrected Effect Size Estimates , 1993 .

[35]  P. Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.

[36]  P. Lachenbruch Statistical Power Analysis for the Behavioral Sciences (2nd ed.) , 1989 .

[37]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[38]  Elena Montañés,et al.  A family of admissible heuristics for A* to perform inference in probabilistic classifier chains , 2016, Machine Learning.

[39]  Maciej Tomczak,et al.  The need to report effect size estimates revisited. An overview of some recommended measures of effect size , 2014 .

[40]  อนิรุธ สืบสิงห์ Data Mining Practical Machine Learning Tools and Techniques , 2014 .

[41]  James Geller,et al.  Data Mining: Practical Machine Learning Tools and Techniques - Book Review , 2002, SIGMOD Rec..

[42]  L. Breiman Random Forests , 2001, Machine Learning.

[43]  Peter E. Hart,et al.  Nearest neighbor pattern classification , 1967, IEEE Trans. Inf. Theory.

[44]  M. Kendall Statistical Methods for Research Workers , 1937, Nature.