Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis

The machine learning community adopted the use of null hypothesis significance testing (NHST) in order to ensure the statistical validity of results. Many scientific fields however realized the shortcomings of frequentist reasoning and in the most radical cases even banned its use in publications. We should do the same: just as we have embraced the Bayesian paradigm in the development of new machine learning methods, so we should also use it in the analysis of our own results. We argue for abandonment of NHST by exposing its fallacies and, more importantly, offer better - more sound and useful - alternatives for it.

[1]  François Laviolette,et al.  Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets , 2012, AISTATS.

[2]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[3]  Marco Zaffalon,et al.  Statistical comparison of classifiers through Bayesian hierarchical modelling , 2016, Machine Learning.

[4]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[5]  S. Young,et al.  On adjusting P-values for multiplicity. Response , 1993 .

[6]  J. Berger,et al.  The Intrinsic Bayes Factor for Model Selection and Prediction , 1996 .

[7]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[8]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[9]  Jiqiang Guo,et al.  Stan: A Probabilistic Programming Language. , 2017, Journal of statistical software.

[10]  Marco Zaffalon,et al.  A Bayesian nonparametric procedure for comparing algorithms , 2015, ICML.

[11]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[12]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[13]  Kevin P. Murphy,et al.  Machine learning - a probabilistic perspective , 2012, Adaptive computation and machine learning series.

[14]  M. Aitkin Posterior Bayes Factors , 1991 .

[15]  John K. Kruschke,et al.  The Bayesian New Statistics: Two historical trends converge , 2015 .

[16]  J. Kruschke Bayesian estimation supersedes the t test. , 2013, Journal of experimental psychology. General.

[17]  Andrew Gelman,et al.  Why We (Usually) Don't Have to Worry About Multiple Comparisons , 2009, 0907.2478.

[18]  Radford M. Neal Pattern Recognition and Machine Learning , 2007, Technometrics.

[19]  Maomi Ueno,et al.  Advanced Methodologies for Bayesian Networks , 2015, Lecture Notes in Computer Science.

[20]  P. Walley Inferences from Multinomial Data: Learning About a Bag of Marbles , 1996 .

[21]  Francesca Mangili,et al.  Should We Really Use Post-Hoc Tests Based on Mean-Ranks? , 2015, J. Mach. Learn. Res..

[22]  Marco Zaffalon,et al.  Imprecise Dirichlet Process With Application to the Hypothesis Test on the Probability That X ≤ Y , 2014 .

[23]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[24]  P. Sen,et al.  Theory of rank tests , 1969 .

[25]  Miguel A. Juárez,et al.  Model-Based Clustering of Non-Gaussian Panel Data Based on Skew-t Distributions , 2010 .

[26]  Marco Zaffalon,et al.  Reliable survival analysis based on the Dirichlet process , 2015, Biometrical journal. Biometrische Zeitschrift.

[27]  A. Gelman Prior distributions for variance parameters in hierarchical models (comment on article by Browne and Draper) , 2004 .

[28]  Alessio Benavoli,et al.  A Bayesian approach for comparing cross-validated algorithms on multiple data sets , 2015, Machine Learning.

[29]  James O. Berger,et al.  Hypothesis testing and model uncertainty , 2013 .

[30]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[31]  James O. Berger,et al.  An overview of robust Bayesian analysis , 1994 .

[32]  James M. Dickey,et al.  Scientific Reporting and Personal Probabilities: Student's Hypothesis , 1973 .

[33]  Marco Zaffalon,et al.  A Bayesian Wilcoxon signed-rank test based on the Dirichlet process , 2014, ICML.

[34]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[35]  Jacques Poitevineau,et al.  The Significance Test Controversy Revisited , 2014 .

[36]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[37]  John K. Kruschke,et al.  Doing Bayesian Data Analysis: A Tutorial with R, JAGS, and Stan , 2014 .

[38]  J.,et al.  Statistical Tests for Joint Analysis of Performance Measures , 2015 .

[39]  Ward Edwards,et al.  Bayesian statistical inference for psychological research. , 1963 .