Comparing the results of replications in software engineering

Context: It has been argued that software engineering replications are useful for verifying the results of previous experiments. However, it has not yet been agreed how to check whether the results hold across replications. Besides, some authors suggest that replications that do not verify the results of previous experiments can be used to identify contextual variables causing the discrepancies. Objective: Study how to assess the (dis)similarity of the results of SE replications when they are compared to verify the results of previous experiments and understand how to identify whether contextual variables are influencing results. Method: We run simulations to learn how different ways of comparing replication results behave when verifying the results of previous experiments. We illustrate how to deal with context-induced changes. To do this, we analyze three groups of replications from our own research on test-driven development and testing techniques. Results: The direct comparison of p-values and effect sizes does not appear to be suitable for verifying the results of previous experiments and examining the variables possibly affecting the results in software engineering. Analytical methods such as meta-analysis should be used to assess the similarity of software engineering replication results and identify discrepancies in results. Conclusion: The results achieved in baseline experiments should no longer be regarded as a result that needs to be reproduced, but as a small piece of evidence within a larger picture that only emerges after assembling many small pieces to complete the puzzle.

[1]  Tony Gorschek,et al.  Contextualizing Research Evidence through Knowledge Translation in Software Engineering , 2019, EASE.

[2]  Anne Whitehead,et al.  Meta-Analysis of Controlled Clinical Trials , 2002 .

[3]  Glenford J. Myers,et al.  Art of Software Testing , 1979 .

[4]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2012, Springer Berlin Heidelberg.

[5]  Tore Dyb,et al.  Incorrect results in software engineering experiments , 2016 .

[6]  Pearl Brereton,et al.  Robust Statistical Methods for Empirical Software Engineering , 2017, Empirical Software Engineering.

[7]  Diana B. Petitti,et al.  Meta-Analysis, Decision Analysis, and Cost-Effectiveness Analysis: Methods for Quantitative Synthesis in Medicine , 1994 .

[8]  Helen Brown,et al.  Applied Mixed Models in Medicine , 2000, Technometrics.

[9]  James Miller,et al.  Replicating software engineering experiments: a poisoned chalice or the Holy Grail , 2005, Inf. Softw. Technol..

[10]  Bruce Thompson,et al.  The pivotal role of replication in psychological research , 1994 .

[11]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[12]  M. Roper,et al.  Replication of Software Engineering Experiments , 2000 .

[13]  Natalia Juristo Juzgado,et al.  The role of non-exact replications in software engineering experiments , 2011, Empirical Software Engineering.

[14]  Tom A. B. Snijders,et al.  Multilevel Analysis , 2011, International Encyclopedia of Statistical Science.

[15]  Brian A. Nosek,et al.  Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015 , 2018, Nature Human Behaviour.

[16]  Paul D. Ellis The Essential Guide to Effect Sizes: Effect sizes and the interpretation of results , 2010 .

[17]  J F Tierney,et al.  A critical review of methods for the assessment of patient-level interactions in individual participant data meta-analysis of randomized trials, and guidance for practitioners. , 2011, Journal of clinical epidemiology.

[18]  Jeffrey C. Carver,et al.  Knowledge-Sharing Issues in Experimental Software Engineering , 2004, Empirical Software Engineering.

[19]  M. Baker 1,500 scientists lift the lid on reproducibility , 2016, Nature.

[20]  Jack Bowden,et al.  A comparison of heterogeneity variance estimators in simulated random‐effects meta‐analyses , 2018, Research synthesis methods.

[21]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[22]  Michael Bryce,et al.  Test 5.14.4. Deposit 18 June 15:43, embargoed 18/07/2019 : Article -> Review article , 2019 .

[23]  Gail C. Murphy,et al.  Beyond Integrated Development Environments: Adding Context to Software Development , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[24]  Natalia Juristo Juzgado,et al.  Replications types in experimental disciplines , 2010, ESEM '10.

[25]  Feng Li,et al.  An Introduction to Metaanalysis , 2005 .

[26]  Natalia Juristo Juzgado,et al.  Understanding replication of experiments in software engineering: A classification , 2014, Inf. Softw. Technol..

[27]  John P A Ioannidis,et al.  Reasons or excuses for avoiding meta-analysis in forest plots , 2008, BMJ : British Medical Journal.

[28]  Gioacchino Leandro,et al.  Meta-Analysis in Medical Research: The Handbook for the Understanding and Practice of Meta-Analysis , 2005 .

[29]  Ronnie E. S. Santos,et al.  Investigations about replication of empirical studies in software engineering: A systematic mapping study , 2015, Inf. Softw. Technol..

[30]  D. Altman,et al.  Measuring inconsistency in meta-analyses , 2003, BMJ : British Medical Journal.

[31]  Barbara A. Kitchenham,et al.  The role of replications in empirical software engineering—a word of warning , 2008, Empirical Software Engineering.

[32]  Mark C Simmonds,et al.  Meta-analysis of individual patient data from randomized trials: a review of methods used in practice , 2005, Clinical trials.

[33]  Laura M. Stapleton,et al.  The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration , 2014, Educational Psychology Review.

[34]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[35]  Natalia Juristo Juzgado,et al.  Comparing the Effectiveness of Equivalence Partitioning, Branch Testing and Code Reading by Stepwise Abstraction Applied by Subjects , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[36]  Tore Dybå,et al.  Incorrect results in software engineering experiments: How to improve research practices , 2016, J. Syst. Softw..

[37]  David Moher,et al.  Investigating clinical heterogeneity in systematic reviews: a methodologic review of guidance in the literature , 2012, BMC Medical Research Methodology.

[38]  Francis Ruvuna,et al.  Unequal Center Sizes, Sample Size, and Power in Multicenter Clinical Trials , 2004 .

[39]  Guilherme Horta Travassos,et al.  Experimentation with dynamic simulation models in software engineering: planning and reporting guidelines , 2015, Empirical Software Engineering.

[40]  Forrest Shull,et al.  Building Knowledge through Families of Experiments , 1999, IEEE Trans. Software Eng..

[41]  Natalia Juristo Juzgado,et al.  Analyzing Families of Experiments in SE: A Systematic Mapping Study , 2018, IEEE Transactions on Software Engineering.

[42]  R. DeShon,et al.  Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. , 2002 .

[43]  Natalia Juristo Juzgado,et al.  Empirical evaluation of the effects of experience on code quality and programmer productivity: an exploratory study , 2017, Empirical Software Engineering.

[44]  Natalia Juristo Juzgado,et al.  An industry experiment on the effects of test-driven development on external quality and productivity , 2017, Empirical Software Engineering.

[45]  Simeon C. Ntafos,et al.  On random and partition testing , 1998, ISSTA.

[46]  Paul D. Ellis,et al.  The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[47]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[48]  Ronnie E. S. Santos,et al.  Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[49]  Maroeska M Rovers,et al.  Subgroup effects despite homogeneous heterogeneity test results , 2010, BMC medical research methodology.

[50]  Andy P. Field,et al.  Discovering Statistics Using Ibm Spss Statistics , 2017 .

[51]  Christopher H Schmid,et al.  Summing up evidence: one answer is not always enough , 1998, The Lancet.

[52]  S. Maxwell,et al.  Is psychology suffering from a replication crisis? What does "failure to replicate" really mean? , 2015, The American psychologist.

[53]  C D Naylor,et al.  Meta-analysis of controlled clinical trials. , 1989, The Journal of rheumatology.

[54]  Natalia Juristo,et al.  Once is not enough: Why we need replication , 2016 .

[55]  Karl E. Peace,et al.  Applied Meta-Analysis with R , 2013 .

[56]  Michael J Crowther,et al.  Using simulation studies to evaluate statistical methods , 2017, Statistics in medicine.

[57]  Michele Tarsilla Cochrane Handbook for Systematic Reviews of Interventions , 2010, Journal of MultiDisciplinary Evaluation.

[58]  Janice Singer,et al.  Guide to Advanced Empirical Software Engineering , 2007 .

[59]  Steve Counsell,et al.  The role and value of replication in empirical software engineering results , 2018, Inf. Softw. Technol..

[60]  Ahnalee M. Brincks,et al.  Modeling Site Effects in the Design and Analysis of Multi-site Trials , 2011, The American journal of drug and alcohol abuse.

[61]  M. Pittler Systematic Reviews in Health Care: Meta‐analysis in Context , 2010 .

[62]  R. Prescott,et al.  Applied Mixed Models in Medicine: Brown/Applied Mixed Models in Medicine , 2014 .

[63]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[64]  G. Cumming,et al.  The New Statistics , 2014, Psychological science.

[65]  J. Higgins,et al.  Cochrane Handbook for Systematic Reviews of Interventions , 2010, International Coaching Psychology Review.

[66]  Fabio Q. B. da Silva,et al.  Replication of empirical studies in software engineering research: a systematic mapping study , 2012, Empirical Software Engineering.

[67]  Alex O. Holcombe,et al.  Is there a reproducibility crisis around here? Maybe not, but we still need to change , 2019 .

[68]  K. Petersen,et al.  Context in industrial software engineering research , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[69]  Brian A. Nosek,et al.  Power failure: why small sample size undermines the reliability of neuroscience , 2013, Nature Reviews Neuroscience.

[70]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[71]  Matthew C. Makel,et al.  Replications in Psychology Research , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[72]  Natalia Juristo Juzgado,et al.  Replications of software engineering experiments , 2013, Empirical Software Engineering.

[73]  Simeon C. Ntafos,et al.  An Evaluation of Random Testing , 1984, IEEE Transactions on Software Engineering.

[74]  Tore Dybå,et al.  A systematic review of statistical power in software engineering experiments , 2006, Inf. Softw. Technol..

[75]  Douglas G. Altman,et al.  Systematic Reviews in Health Care: Meta-Analysis in Context: Second Edition , 2008 .

[76]  Per Runeson,et al.  Guidelines for conducting and reporting case study research in software engineering , 2009, Empirical Software Engineering.

[77]  Natalia Juristo Juzgado,et al.  A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications , 2020, IEEE Transactions on Software Engineering.

[78]  Mehrdad Sabetzadeh,et al.  The Case for Context-Driven Software Engineering Research: Generalizability Is Overrated , 2017, IEEE Softw..

[79]  H. Pashler,et al.  Editors’ Introduction to the Special Section on Replicability in Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[80]  Natalia Juristo Juzgado,et al.  Basics of Software Engineering Experimentation , 2010, Springer US.

[81]  M. Host,et al.  Experimental context classification: incentives and experience of subjects , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[82]  Natalia Juristo Juzgado,et al.  Using differences among replications of software engineering experiments to gain knowledge , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[83]  J. Leek,et al.  What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science , 2016, Perspectives on psychological science : a journal of the Association for Psychological Science.

[84]  Tim Menzies,et al.  Special issue on repeatable results in software engineering prediction , 2012, Empirical Software Engineering.

[85]  Kent L. Beck,et al.  Test-driven Development - by example , 2002, The Addison-Wesley signature series.