Sample size calculations for the experimental comparison of multiple algorithms on multiple problem instances

This work presents a statistically principled method for estimating the required number of instances in the experimental comparison of multiple algorithms on a given problem class of interest. This approach generalises earlier results by allowing researchers to design experiments based on the desired best, worst, mean or median-case statistical power to detect differences between algorithms larger than a certain threshold. Holm's step-down procedure is used to maintain the overall significance level controlled at desired levels, without resulting in overly conservative experiments. This paper also presents an approach for sampling each algorithm on each instance, based on optimal sample size ratios that minimise the total required number of runs subject to a desired accuracy in the estimation of paired differences. A case study investigating the effect of 21 variants of a custom-tailored Simulated Annealing for a class of scheduling problems is used to illustrate the application of the proposed methods for sample size calculations in the experimental comparison of algorithms.

[1]  Mauro Birattari,et al.  How to assess and report the performance of a stochastic algorithm on a benchmark problem: mean or best result on a number of runs? , 2007, Optim. Lett..

[2]  David J. Sheskin,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 1997 .

[3]  Fernanda C. Takahashi,et al.  Sample size estimation for power and accuracy in the experimental comparison of algorithms , 2019, J. Heuristics.

[4]  Greet Vanden Berghe,et al.  Analysis of stochastic local search methods for the unrelated parallel machine scheduling problem , 2019, Int. Trans. Oper. Res..

[5]  HerreraFrancisco,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining , 2010 .

[6]  Elizabeth F. Wanner,et al.  A Multicriteria Statistical Based Comparison Methodology for Evaluating Evolutionary Algorithms , 2011, IEEE Transactions on Evolutionary Computation.

[7]  David S. Johnson,et al.  A theoretician's guide to the experimental analysis of algorithms , 1999, Data Structures, Near Neighbor Searches, and Methodology.

[8]  Andrew Gelman,et al.  Data Analysis Using Regression and Multilevel/Hierarchical Models , 2006 .

[9]  Eugene L. Lawler,et al.  Sequencing and scheduling: algorithms and complexity , 1989 .

[10]  J. Revuelta,et al.  Optimization of sample size in controlled experiments: The CLAST rule , 2006, Behavior research methods.

[11]  Francisco Herrera,et al.  A practical tutorial on the use of nonparametric statistical tests as a methodology for comparing evolutionary and swarm intelligence algorithms , 2011, Swarm Evol. Comput..

[12]  Paul D. Ellis,et al.  The Essential Guide to Effect Sizes: Contents , 2010 .

[13]  Volker H. Franz,et al.  Ratios: A short guide to confidence limits and proper use , 2007, 0710.2024.

[14]  Marcus Gallagher,et al.  An improved small-sample statistical test for comparing the success rates of evolutionary algorithms , 2009, GECCO '09.

[15]  Russell V. Lenth,et al.  Some Practical Guidelines for Effective Sample Size Determination , 2001 .

[16]  John N. Hooker,et al.  Needed: An Empirical Science of Algorithms , 1994, Oper. Res..

[17]  Stanley E Lazic,et al.  The problem of pseudoreplication in neuroscientific studies: is it affecting your analysis? , 2010, BMC Neuroscience.

[18]  Ofer M. Shir,et al.  Bayesian performance analysis for black-box optimization benchmarking , 2019, GECCO.

[19]  Russell B. Millar,et al.  Remedies for pseudoreplication , 2004 .

[20]  Robert V. Brill,et al.  Applied Statistics and Probability for Engineers , 2004, Technometrics.

[21]  Mauricio G. C. Resende,et al.  Designing and reporting on computational experiments with heuristic methods , 1995, J. Heuristics.

[22]  Jay Bartroff,et al.  Sequential Experimentation in Clinical Trials , 2013 .

[23]  Thomas Bartz-Beielstein How to Create Generalizable Results , 2015, Handbook of Computational Intelligence.

[24]  Matthew J. Saltzman,et al.  Statistical Analysis of Computational Tests of Algorithms and Heuristics , 2000, INFORMS J. Comput..

[25]  Rodolfo Lourenzutti,et al.  Ranking and comparing evolutionary algorithms with Hellinger-TOPSIS , 2015, Appl. Soft Comput..

[26]  Marco Zaffalon,et al.  Time for a change: a tutorial for comparing multiple classifiers through Bayesian analysis , 2016, J. Mach. Learn. Res..

[27]  Catherine C. McGeoch Feature Article - Toward an Experimental Method for Algorithm Simulation , 1996, INFORMS J. Comput..

[28]  Marco Zaffalon,et al.  A Bayesian Wilcoxon signed-rank test based on the Dirichlet process , 2014, ICML.

[29]  Jay Bartroff,et al.  Sequential Experimentation in Clinical Trials: Design and Analysis , 2012 .

[30]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[31]  Paul Mathews,et al.  Sample Size Calculations: Practical Methods for Engineers and Scientists , 2010 .

[32]  Francisco Herrera,et al.  A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability , 2009, Soft Comput..

[33]  J. Kruschke Doing Bayesian Data Analysis: A Tutorial with R and BUGS , 2010 .

[34]  Thomas Bartz-Beielstein,et al.  Experimental Methods for the Analysis of Optimization Algorithms , 2010 .

[35]  Thomas Bartz-Beielstein,et al.  Experimental research in evolutionary computation , 2007, GECCO '07.

[36]  A. E. Eiben,et al.  A critical note on experimental research methodology in EC , 2002, Proceedings of the 2002 Congress on Evolutionary Computation. CEC'02 (Cat. No.02TH8600).

[37]  Thomas Bartz-Beielstein,et al.  New experimentalism applied to evolutionary computation , 2005 .

[38]  Rubén Ruiz,et al.  A genetic algorithm for the unrelated parallel machine scheduling problem with sequence dependent setup times , 2011, Eur. J. Oper. Res..

[39]  E.L. Lawler,et al.  Optimization and Approximation in Deterministic Sequencing and Scheduling: a Survey , 1977 .

[40]  E. C. Fieller SOME PROBLEMS IN INTERVAL ESTIMATION , 1954 .

[41]  Enda Ridge,et al.  Design of Experiments for the Tuning of Optimisation Algorithms , 2007 .

[42]  Marcus Gallagher,et al.  Statistical Racing Techniques for Improved Empirical Evaluation of Evolutionary Algorithms , 2004, PPSN.

[43]  S. Hurlbert Pseudoreplication and the Design of Ecological Field Experiments , 1984 .

[44]  Felipe Campelo CAISEr: Comparison of Algorithms with Iterative Sample Size Estimation , 2017 .

[45]  Anne Auger,et al.  COCO: The Experimental Procedure , 2016, ArXiv.

[46]  M. Birattari,et al.  Artificielle On the Estimation of the Expected Performance of a Metaheuristic on a Class of Instances How many instances , how many runs ? , 2004 .

[47]  Douglas C. Montgomery,et al.  Applied Statistics and Probability for Engineers, Third edition , 1994 .

[48]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[49]  Kenneth Sörensen,et al.  A History of Metaheuristics , 2015 .

[50]  Karsten Klein,et al.  Algorithm Engineering: Concepts and Practice , 2010, Experimental Methods for the Analysis of Optimization Algorithms.

[51]  J. Shaffer Multiple Hypothesis Testing , 1995 .

[52]  Francisco Herrera,et al.  Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power , 2010, Inf. Sci..

[53]  Eugene L. Lawler,et al.  Chapter 9 Sequencing and scheduling: Algorithms and complexity , 1993, Logistics of Production and Inventory.

[54]  Ray Jain,et al.  The art of computer systems performance analysis - techniques for experimental design, measurement, simulation, and modeling , 1991, Wiley professional computing.

[55]  Berwin A. Turlach,et al.  Statistical exploratory analysis of genetic algorithms: the importance of interaction , 2004, Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753).

[56]  O. J. Dunn Multiple Comparisons among Means , 1961 .

[57]  David A. Pelta,et al.  An algorithm comparison for dynamic optimization problems , 2012, Appl. Soft Comput..

[58]  Francisco Herrera,et al.  A study on the use of non-parametric tests for analyzing the evolutionary algorithms’ behaviour: a case study on the CEC’2005 Special Session on Real Parameter Optimization , 2009, J. Heuristics.

[59]  P. Ellis,et al.  The Essential Guide to Effect Sizes: Power analysis and the detection of effects , 2010 .

[60]  Francisco Herrera,et al.  Analyzing convergence performance of evolutionary algorithms: A statistical approach , 2014, Inf. Sci..

[61]  John N. Hooker,et al.  Testing heuristics: We have it all wrong , 1995, J. Heuristics.

[62]  Mauro Birattari,et al.  Tuning Metaheuristics - A Machine Learning Perspective , 2009, Studies in Computational Intelligence.