Recent Trends in the Use of Statistical Tests for Comparing Swarm and Evolutionary Computing Algorithms: Practical Guidelines and a Critical Review

A key aspect of the design of evolutionary and swarm intelligence algorithms is studying their performance. Statistical comparisons are also a crucial part which allows for reliable conclusions to be drawn. In the present paper we gather and examine the approaches taken from different perspectives to summarise the assumptions made by these statistical tests, the conclusions reached and the steps followed to perform them correctly. In this paper, we conduct a survey on the current trends of the proposals of statistical analyses for the comparison of algorithms of computational intelligence and include a description of the statistical background of these tests. We illustrate the use of the most common tests in the context of the Competition on single-objective real parameter optimisation of the IEEE Congress on Evolutionary Computation (CEC) 2017 and describe the main advantages and drawbacks of the use of each kind of test and put forward some recommendations concerning their use.

[1]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[2]  Tome Eftimov,et al.  A novel statistical approach for comparing meta-heuristic stochastic optimization algorithms according to the distribution of solutions in the search space , 2019, Inf. Sci..

[3]  Petr Bujok,et al.  Enhanced individual-dependent differential evolution with population size adaptation , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[4]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[5]  R. Iman,et al.  Approximations of the critical region of the fbietkan statistic , 1980 .

[6]  Luis Montesano,et al.  Beyond p-values in the evaluation of brain–computer interfaces: A Bayesian estimation approach , 2016, Journal of Neuroscience Methods.

[7]  Alessio Benavoli,et al.  Joint Analysis of Multiple Algorithms and Performance Measures , 2016, New Generation Computing.

[8]  Erkan Bostanci,et al.  An Evaluation of Classification Algorithms Using Mc Nemar's Test , 2012, BIC-TA.

[9]  Francisco Herrera,et al.  On the statistical analysis of the parameters’ trend in a machine learning algorithm , 2014, Progress in Artificial Intelligence.

[10]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[11]  Elizabeth F. Wanner,et al.  A Multicriteria Statistical Based Comparison Methodology for Evaluating Evolutionary Algorithms , 2011, IEEE Transactions on Evolutionary Computation.

[12]  Tsuyoshi Murata,et al.  {m , 1934, ACML.

[13]  Hans-Georg Beyer,et al.  Benchmarking evolutionary algorithms for single objective real-valued constrained optimization - A critical review , 2018, Swarm Evol. Comput..

[14]  Fernanda C. Takahashi,et al.  Sample size estimation for power and accuracy in the experimental comparison of algorithms , 2019, J. Heuristics.

[15]  B. M. Brown,et al.  Permutation Tests for Complex Data: Theory, Applications and Software by F. Pesarin and L. Salmaso , 2012 .

[16]  Marco Zaffalon,et al.  Statistical comparison of classifiers through Bayesian hierarchical modelling , 2016, Machine Learning.

[17]  Jane You,et al.  A New Kind of Nonparametric Test for Statistical Comparison of Multiple Classifiers Over Multiple Datasets , 2017, IEEE Transactions on Cybernetics.

[18]  Bruno D. Zumbo,et al.  A New Nonparametric Levene Test for Equal Variances , 2010 .

[19]  Ponnuthurai Nagaratnam Suganthan,et al.  Problem Definitions and Evaluation Criteria for the CEC 2014 Special Session and Competition on Single Objective Real-Parameter Numerical Optimization , 2014 .

[20]  Inés Couso,et al.  Bootstrap analysis of multiple repetitions of experiments using an interval-valued multiple comparison procedure , 2014, J. Comput. Syst. Sci..

[21]  Dana Quade,et al.  Rank Analysis of Covariance , 1967 .

[22]  Alessio Benavoli,et al.  A Bayesian approach for comparing cross-validated algorithms on multiple data sets , 2015, Machine Learning.

[23]  G. Willems,et al.  A robust Hotelling test , 2002 .

[24]  Ethem Alpaydin,et al.  Ordering and finding the best of K > 2 supervised learning algorithms , 2006, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[25]  Inés Couso,et al.  Generalizing the Wilcoxon rank-sum test for interval data , 2015, Int. J. Approx. Reason..

[26]  Ignacio Rojas,et al.  Statistical analysis of the parameters of a neuro-genetic algorithm , 2002, IEEE Trans. Neural Networks.

[27]  Francisco Herrera,et al.  rNPBST: An R Package Covering Non-parametric and Bayesian Statistical Tests , 2017, HAIS.

[28]  Tome Eftimov,et al.  Comparing multi-objective optimization algorithms using an ensemble of quality indicators with deep statistical comparison approach , 2017, 2017 IEEE Symposium Series on Computational Intelligence (SSCI).

[29]  M J Campbell,et al.  Statistics in Medicine: Calculating confidence intervals for some non-parametric analyses , 1988 .

[30]  I. Couso,et al.  Reconciling Bayesian and Frequentist Tests: the Imprecise Counterpart , 2017, ISIPTA.

[31]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[32]  Francisco Herrera,et al.  A study on the use of statistical tests for experimentation with neural networks: Analysis of parametric test conditions and non-parametric tests , 2007, Expert Syst. Appl..

[33]  Yu Wang,et al.  Confidence Interval for F1 Measure of Algorithm Performance Based on Blocked 3× 2 Cross-Validation , 2015, IEEE Trans. Knowl. Data Eng..

[34]  Devender Singh,et al.  Improving the local search capability of Effective Butterfly Optimizer using Covariance Matrix Adapted Retreat Phase , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[35]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[36]  Ethem Alpaydin,et al.  Statistical Tests Using Hinge/ε-Sensitive Loss , 2012, ISCIS.

[37]  Wang Yu,et al.  Blocked 3×2 Cross-Validated t-Test for Comparing Supervised Classification Learning Algorithms , 2014, Neural Comput..

[38]  François Laviolette,et al.  Bayesian Comparison of Machine Learning Algorithms on Single and Multiple Datasets , 2012, AISTATS.

[39]  Francisco Herrera,et al.  A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special Session on Real Parameter Optimization , 2009, J. Heuristics.

[40]  Prakash Kotecha,et al.  Teaching Learning Based Optimization with focused learning and its performance on CEC2017 functions , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[41]  Ponnuthurai N. Suganthan,et al.  Ensemble sinusoidal differential covariance matrix adaptation with Euclidean neighborhood for solving CEC2017 benchmark problems , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[42]  Zbigniew Telec,et al.  Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms , 2012, Int. J. Appl. Math. Comput. Sci..

[43]  A. Mood,et al.  The statistical sign test. , 1946, Journal of the American Statistical Association.

[44]  John K Kruschke,et al.  Bayesian data analysis. , 2010, Wiley interdisciplinary reviews. Cognitive science.

[45]  Daniel Berrar,et al.  Confidence curves: an alternative to null hypothesis significance testing for the comparison of classifiers , 2017, Machine Learning.

[46]  R. G. D. Steel,et al.  Tables for a Treatments Versus Control Multiple Comparisons Sign Test , 1965 .

[47]  Mita Nasipuri,et al.  Statistical validation of multiple classifiers over multiple datasets in the field of pattern recognition , 2015, Int. J. Appl. Pattern Recognit..

[48]  R. Iman,et al.  Rank Transformations as a Bridge between Parametric and Nonparametric Statistics , 1981 .

[49]  M. Pepe,et al.  Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs , 2006, Clinical trials.

[50]  Eiiti Kasuya,et al.  Wilcoxon signed-ranks test: symmetry should be confirmed before the test , 2010, Animal Behaviour.

[51]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[52]  Ethem Alpaydın,et al.  Combined 5 x 2 cv F Test for Comparing Supervised Classification Learning Algorithms , 1999, Neural Comput..

[53]  Rafal Biedrzycki A version of IPOP-CMA-ES algorithm with midpoint for CEC 2017 single objective bound constrained problems , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[54]  José Antonio Lozano,et al.  Sensitivity Analysis of k-Fold Cross Validation in Prediction Error Estimation , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[55]  P. Westfall,et al.  Multiple McNemar Tests , 2010, Biometrics.

[56]  Marco S. Nobile,et al.  Proactive Particles in Swarm Optimization: A settings-free algorithm for real-parameter single objective optimization problems , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[57]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[58]  Gemma C. Garriga,et al.  Permutation Tests for Studying Classifier Performance , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[59]  Ritu Gupta,et al.  Statistical exploratory analysis of genetic algorithms , 2004, IEEE Transactions on Evolutionary Computation.

[60]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[61]  Francisco Herrera,et al.  Analyzing convergence performance of evolutionary algorithms: A statistical approach , 2014, Inf. Sci..

[62]  N. Lazar,et al.  Moving to a World Beyond “p < 0.05” , 2019, The American Statistician.

[63]  Joachim M. Buhmann,et al.  Bayesian mixed-effects inference on classification performance in hierarchical data sets , 2012, J. Mach. Learn. Res..

[64]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[65]  Stephen W. Looney,et al.  A statistical technique for comparing the accuracies of several classifiers , 1988, Pattern Recognit. Lett..

[66]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[67]  Jaroslaw Arabas,et al.  A differential evolution strategy , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[68]  J. L. Hodges,et al.  Rank Methods for Combination of Independent Experiments in Analysis of Variance , 1962 .

[69]  O. Pons Statistical Tests of Nonparametric Hypotheses: Asymptotic Theory , 2013 .

[70]  Marco Zaffalon,et al.  A Bayesian nonparametric procedure for comparing algorithms , 2015, ICML.

[71]  Ethem Alpaydin,et al.  Cost-conscious comparison of supervised learning algorithms over multiple data sets , 2012, Pattern Recognit..

[72]  Elisa Guerrero Vázquez,et al.  Multiple comparison procedures applied to model selection , 2002, Neurocomputing.

[73]  D. Rindskopf Null-hypothesis tests are not completely stupid, but Bayesian statistics are better , 1998, Behavioral and Brain Sciences.

[74]  Ruhul A. Sarker,et al.  Multi-method based orthogonal experimental design algorithm for solving CEC2017 competition problems , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[75]  Ivair R. Silva,et al.  On the correspondence between frequentist and Bayesian tests , 2018 .

[76]  L. Salmaso,et al.  Permutation tests for complex data : theory, applications and software , 2010 .

[77]  José Antonio Lozano,et al.  Bayesian inference for algorithm ranking analysis , 2018, GECCO.

[78]  Sander Greenland,et al.  Scientists rise up against statistical significance , 2019, Nature.

[79]  Lloyd S. Nelson,et al.  Common Errors in Statistics (and How to Avoid Them) , 2005 .

[80]  Marco Zaffalon,et al.  Imprecise Dirichlet Process With Application to the Hypothesis Test on the Probability That X ≤ Y , 2014 .

[81]  Antonio LaTorre,et al.  A comparison of three large-scale global optimizers on the CEC 2017 single objective real parameter numerical optimization benchmark , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[82]  Janez Brest,et al.  Single objective real-parameter optimization: Algorithm jSO , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[83]  Lovedeep Gondara Classifier comparison using precision , 2016, ArXiv.

[84]  Xiuzhen Cheng,et al.  Statistical Comparisons of Multiple Classifiers , 2003, MLMTA.

[85]  José A. Villaseñor Alva,et al.  A Generalization of Shapiro–Wilk's Test for Multivariate Normality , 2009 .

[86]  Subhabrata Chakraborti,et al.  Nonparametric Statistical Inference , 2011, International Encyclopedia of Statistical Science.

[87]  Werner Dubitzky,et al.  On the Jeffreys-Lindley Paradox and the Looming Reproducibility Crisis in Machine Learning , 2017, 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA).

[88]  Ethem Alpaydin,et al.  Design and Analysis of Classifier Learning Experiments in Bioinformatics: Survey and Case Studies , 2012, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[89]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[90]  Özlem Aslan,et al.  Statistical Comparison of Classifiers Using Area Under the ROC Curve , 2009 .

[91]  José Antonio Lozano,et al.  Significance tests or confidence intervals: which are preferable for the comparison of classifiers? , 2013, J. Exp. Theor. Artif. Intell..

[92]  Anas A. Hadi,et al.  LSHADE with semi-parameter adaptation hybrid with CMA-ES for solving CEC 2017 benchmark problems , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[93]  Prakash Kotecha,et al.  Dynamic Yin-Yang Pair Optimization and its performance on single objective real parameter problems of CEC 2017 , 2017, 2017 IEEE Congress on Evolutionary Computation (CEC).

[94]  Marco Zaffalon,et al.  A Bayesian Wilcoxon signed-rank test based on the Dirichlet process , 2014, ICML.

[95]  Tom Heskes,et al.  Exact p-values for pairwise comparison of Friedman rank sums, with application to comparing classifiers , 2017, BMC Bioinformatics.

[96]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[97]  Alessio Benavoli,et al.  Statistical Tests for Joint Analysis of Performance Measures , 2015, AMBN@JSAI-isAI.