Statistical Comparisons of Classifiers over Multiple Data Sets

While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.

[1]  M. Friedman The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance , 1937 .

[2]  M. Friedman A Comparison of Alternative Tests of Significance for the Problem of $m$ Rankings , 1940 .

[3]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[4]  J. Tukey Comparing individual means in the analysis of variance. , 1949, Biometrics.

[5]  C. Dunnett A Multiple Comparison Procedure for Comparing Several Treatments with a Control , 1955 .

[6]  M. S. Bartlett,et al.  Statistical methods and scientific inference. , 1957 .

[7]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[8]  W. Cleveland Robust Locally Weighted Regression and Smoothing Scatterplots , 1979 .

[9]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[10]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[11]  J R Beck,et al.  The use of relative operating characteristic (ROC) curves in test performance evaluation. , 1986, Archives of pathology & laboratory medicine.

[12]  Y. Hochberg A sharper Bonferroni procedure for multiple tests of significance , 1988 .

[13]  G. Hommel A stagewise rejective multiple test procedure based on a modified Bonferroni test , 1988 .

[14]  J. Drew Modern Data Analysis: A First Course in Applied Statistics , 1991 .

[15]  Usama M. Fayyad,et al.  Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning , 1993, IJCAI.

[16]  Jacob Cohen The earth is round (p < .05) , 1994 .

[17]  David A. Hull,et al.  Dean of Graduate Studies , 2000 .

[18]  Hinrich Schütze,et al.  A comparison of classifiers and document representations for the routing problem , 1995, SIGIR '95.

[19]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[20]  F. Schmidt Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers , 1996 .

[21]  J. Ross Quinlan,et al.  Bagging, Boosting, and C4.5 , 1996, AAAI/IAAI, Vol. 1.

[22]  L. Harlow,et al.  What if there were no significance tests , 1997 .

[23]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[24]  Ron Kohavi,et al.  The Case against Accuracy Estimation for Comparing Induction Algorithms , 1998, ICML.

[25]  Catherine Blake,et al.  UCI Repository of machine learning databases , 1998 .

[26]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[27]  Dunja Mladenic,et al.  Feature Selection for Unbalanced Class Distribution and Naive Bayes , 1999, ICML.

[28]  R. Bellazzi,et al.  Intelligent Data Analysis in Medicine and Pharmacology: A Position Statement , 2000 .

[29]  Carlos Soares,et al.  A Comparison of Ranking Methods for Classification Algorithm Selection , 2000, ECML.

[30]  Pat Langley,et al.  Crafting Papers on Machine Learning , 2000, ICML.

[31]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[32]  Elisa Guerrero Vázquez,et al.  Repeated Measures Multiple Comparison Procedures Applied to Model Selection in Neural Networks , 2001, IWANN.

[33]  Elisa Guerrero Vázquez,et al.  Multiple comparison procedures applied to model selection , 2002, Neurocomputing.

[34]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[35]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[36]  Remco R. Bouckaert,et al.  Choosing Between Two Learning Algorithms Based on Calibrated Tests , 2003, ICML.

[37]  Steven Salzberg,et al.  On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach , 1997, Data Mining and Knowledge Discovery.

[38]  Remco R. Bouckaert,et al.  Estimating replicability of classifier learning experiments , 2004, ICML.

[39]  Blaz Zupan,et al.  Orange: From Experimental Machine Learning to Interactive Data Mining , 2004, PKDD.

[40]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[41]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[42]  Geoffrey I. Webb,et al.  MultiBoosting: A Technique for Combining Boosting and Wagging , 2000, Machine Learning.