A Procedure and Guidelines for Analyzing Groups of Software Engineering Replications

Context: Researchers from different groups and institutions are collaborating on building groups of experiments by means of replication (i.e., conducting groups of replications). Disparate aggregation techniques are being applied to analyze groups of replications. The application of unsuitable techniques to aggregate replication results may undermine the potential of groups of replications to provide in-depth insights from experiment results. Objectives: Provide an analysis procedure with a set of embedded guidelines to aggregate software engineering (SE) replication results. Method: We compare the characteristics of groups of replications for SE and other mature experimental disciplines such as medicine and pharmacology. In view of their differences, the limitations with regard to the joint data analysis of groups of SE replications and the guidelines provided in mature experimental disciplines to analyze groups of replications, we build an analysis procedure with a set of embedded guidelines specifically tailored to the analysis of groups of SE replications. We apply the proposed analysis procedure to a representative group of SE replications to illustrate its use. Results: All the information contained within the raw data should be leveraged during the aggregation of replication results. The analysis procedure that we propose encourages the use of stratified individual participant data and aggregated data in tandem to analyze groups of SE replications. Conclusion: The aggregation techniques used to analyze groups of replications should be justified in research articles. This will increase the reliability and transparency of joint results. The proposed guidelines should ease this endeavor.

[1]  Jeffrey C. Carver,et al.  Knowledge-Sharing Issues in Experimental Software Engineering , 2004, Empirical Software Engineering.

[2]  D. Moher,et al.  CONSORT 2010 Statement: updated guidelines for reporting parallel group randomised trials , 2010, BMC medicine.

[3]  Natalia Juristo Juzgado,et al.  Basics of Software Engineering Experimentation , 2010, Springer US.

[4]  L. Bickman,et al.  The Sage Handbook of Social Research Methods , 2008 .

[5]  Enrique Fernández AGGREGATION PROCESS WITH MULTIPLE EVIDENCE LEVELS FOR EXPERIMENTAL STUDIES IN SOFTWARE ENGINEERING , 2007 .

[6]  Hong Lai,et al.  Nonparametric bootstrapping for hierarchical data , 2010 .

[7]  J. Reid Experimental Design and Data Analysis for Biologists , 2003 .

[8]  Natalia Juristo Juzgado,et al.  Replications of software engineering experiments , 2013, Empirical Software Engineering.

[9]  Jennie Popay,et al.  Testing Methodological Guidance on the Conduct of Narrative Synthesis in Systematic Reviews , 2009 .

[10]  Pearl Brereton,et al.  Evidence-Based Software Engineering and Systematic Reviews , 2015 .

[11]  Barbara Kitchenham,et al.  Procedures for Performing Systematic Reviews , 2004 .

[12]  Pearl Brereton,et al.  Robust Statistical Methods for Empirical Software Engineering , 2017, Empirical Software Engineering.

[13]  A. Holbrook,et al.  Comparing methods to estimate treatment effects on a continuous outcome in multicentre randomized controlled trials: A simulation study , 2011, BMC medical research methodology.

[14]  Richard D Riley,et al.  Preferred Reporting Items for a Systematic Review and Meta-analysis of Individual Participant Data: The PRISMA-IPD Statement , 2015 .

[15]  Massimiliano Di Penta,et al.  Assessing, Comparing, and Combining State Machine-Based Testing and Structural Testing: A Series of Experiments , 2011, IEEE Transactions on Software Engineering.

[16]  J F Tierney,et al.  A critical review of methods for the assessment of patient-level interactions in individual participant data meta-analysis of randomized trials, and guidance for practitioners. , 2011, Journal of clinical epidemiology.

[17]  A. Localio,et al.  Adjustments for Center in Multicenter Studies: An Overview , 2001, Annals of Internal Medicine.

[18]  Harold I Feldman,et al.  Individual patient‐ versus group‐level data meta‐regressions for the investigation of treatment effect modifiers: ecological bias rears its ugly head , 2002, Statistics in medicine.

[19]  Orestis Efthimiou,et al.  Get real in individual participant data (IPD) meta‐analysis: a review of the methodology , 2015, Research synthesis methods.

[20]  Ulrich Frank,et al.  Multilevel Modeling , 2014, Business & Information Systems Engineering.

[21]  Natalia Juristo Juzgado,et al.  Replication of Software Engineering Experiments , 2010, LASER Summer School.

[22]  I. Chalmers The Cochrane Collaboration: Preparing, Maintaining, and Disseminating Systematic Reviews of the Effects of Health Care , 1993, Annals of the New York Academy of Sciences.

[23]  Tore Dybå,et al.  A systematic review of statistical power in software engineering experiments , 2006, Inf. Softw. Technol..

[24]  Barbara A. Kitchenham,et al.  Combining empirical results in software engineering , 1998, Inf. Softw. Technol..

[25]  John P A Ioannidis,et al.  Reasons or excuses for avoiding meta-analysis in forest plots , 2008, BMJ : British Medical Journal.

[26]  G. Moon,et al.  Context, composition and heterogeneity: using multilevel models in health research. , 1998, Social science & medicine.

[27]  R. Little,et al.  The prevention and treatment of missing data in clinical trials. , 2012, The New England journal of medicine.

[28]  J. Lewis,et al.  Statistical principles for clinical trials (ICH E9): an introductory note on an international guideline. , 1999, Statistics in medicine.

[29]  C. Field,et al.  Bootstrapping clustered data , 2007 .

[30]  Marco Torchiano,et al.  On the effectiveness of the test-first approach to programming , 2005, IEEE Transactions on Software Engineering.

[31]  Jeffrey C. Carver,et al.  Replication types: towards a shared taxonomy , 2014, EASE '14.

[32]  Forrest Shull,et al.  Building Knowledge through Families of Experiments , 1999, IEEE Trans. Software Eng..

[33]  C. McCulloch,et al.  Generalized Linear Mixed Models , 2005 .

[34]  L. Stewart,et al.  To IPD or not to IPD? , 2002, Evaluation & the health professions.

[35]  Barbara A. Kitchenham,et al.  The role of replications in empirical software engineering—a word of warning , 2008, Empirical Software Engineering.

[36]  Cora J. M. Maas,et al.  The Multilevel Approach to Repeated Measures for Complete and Incomplete Data , 2003 .

[37]  Tore Dybå,et al.  Evidence-based software engineering , 2016, Perspectives on Data Science for Software Engineering.

[38]  Mario Piattini,et al.  A family of experiments to validate metrics for software process models , 2005, J. Syst. Softw..

[39]  R. DeShon,et al.  Combining effect size estimates in meta-analysis with repeated measures and independent-groups designs. , 2002 .

[40]  R. Ostelo,et al.  Rational and design of an individual participant data meta-analysis of spinal manipulative therapy for chronic low back pain—a protocol , 2017, Systematic Reviews.

[41]  Lesa Hoffman,et al.  Multilevel models for the experimental psychologist: Foundations and illustrative examples , 2007, Behavior research methods.

[42]  Gary H Lyman,et al.  The strengths and limitations of meta-analyses based on aggregate data , 2005, BMC Medical Research Methodology.

[43]  Brian Fitzgerald,et al.  The ABC of Software Engineering Research , 2018, ACM Trans. Softw. Eng. Methodol..

[44]  Jessica Gurevitch,et al.  Meta-analysis and the science of research synthesis , 2018, Nature.

[45]  Alan Phillips,et al.  ICH E9 guideline ‘Statistical principles for clinical trials’: a case study , 2003, Statistics in medicine.

[46]  Nikolas Pautz,et al.  The use of nonparametric effect sizes in single study musculoskeletal physiotherapy research: A practical primer. , 2018, Physical therapy in sport : official journal of the Association of Chartered Physiotherapists in Sports Medicine.

[47]  C. McCulloch,et al.  Misspecifying the Shape of a Random Effects Distribution: Why Getting It Wrong May Not Matter , 2011, 1201.1980.

[48]  Sandy Oliver,et al.  Integrating qualitative research with trials in systematic reviews , 2004, BMJ : British Medical Journal.

[49]  Fabio Q. B. da Silva,et al.  Replication of empirical studies in software engineering research: a systematic mapping study , 2012, Empirical Software Engineering.

[50]  Tore Dybå,et al.  The effectiveness of pair programming: A meta-analysis , 2009, Inf. Softw. Technol..

[51]  Tom S. Clark,et al.  Should I Use Fixed or Random Effects? , 2014, Political Science Research and Methods.

[52]  Ewout W. Steyerberg,et al.  Individual participant data meta-analyses should not ignore clustering , 2013, Journal of clinical epidemiology.

[53]  D. DeMets,et al.  Fundamentals of Clinical Trials , 1982 .

[54]  Ronnie E. S. Santos,et al.  Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[55]  Magnus C. Ohlsson,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[56]  Natalia Juristo Juzgado,et al.  Analyzing Families of Experiments in SE: A Systematic Mapping Study , 2018, IEEE Transactions on Software Engineering.

[57]  H. Kraemer,et al.  Pitfalls of multisite randomized clinical trials of efficacy and effectiveness. , 2000, Schizophrenia bulletin.

[58]  L. Hedges,et al.  Vote-counting methods in research synthesis. , 1980 .

[59]  R. Pickering,et al.  The analysis of continuous outcomes in multi‐centre trials with small centre sizes , 2007, Statistics in medicine.

[60]  J. Berry,et al.  Analyzing Longitudinal Data with Multilevel Models: An Example with Individuals Living with Lower Extremity Intra-articular Fractures. , 2008, Rehabilitation psychology.

[61]  Morten W Fagerland,et al.  t-tests, non-parametric tests, and large studies—a paradox of statistical practice? , 2012, BMC Medical Research Methodology.

[62]  Jeffrey C. Carver,et al.  The role of replications in Empirical Software Engineering , 2008, Empirical Software Engineering.

[63]  Jennie Popay,et al.  Guidance on the conduct of narrative synthesis in systematic Reviews. A Product from the ESRC Methods Programme. Version 1 , 2006 .

[64]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[65]  Jeffrey C. Carver,et al.  A Multi-Site Joint Replication of a Design Patterns Experiment Using Moderator Variables to Generalize across Contexts , 2016, IEEE Transactions on Software Engineering.

[66]  B. Kahan,et al.  Analysis of multicentre trials with continuous outcomes: when and how should we account for centre effects? , 2013, Statistics in medicine.

[67]  Mark C Simmonds,et al.  Meta-analysis of individual patient data from randomized trials: a review of methods used in practice , 2005, Clinical trials.

[68]  Natalia Juristo Juzgado,et al.  Determining the effectiveness of three software evaluation techniques through informal aggregation , 2013, Inf. Softw. Technol..

[69]  Mike Clarke,et al.  Individual Participant Data (IPD) Meta-analyses of Randomised Controlled Trials: Guidance on Their Use , 2015, PLoS medicine.

[70]  Jennifer J. Richler,et al.  Effect size estimates: current use, calculations, and interpretation. , 2012, Journal of experimental psychology. General.

[71]  Khaled El Emam,et al.  An Internally Replicated Quasi-Experimental Comparison of Checklist and Perspective-Based Reading of Code Documents , 2001, IEEE Trans. Software Eng..

[72]  B. Kahan,et al.  Assessing potential sources of clustering in individually randomised trials , 2013, BMC Medical Research Methodology.

[73]  Amela Karahasanovic,et al.  A survey of controlled experiments in software engineering , 2005, IEEE Transactions on Software Engineering.

[74]  Laura M. Stapleton,et al.  The Effect of Small Sample Size on Two-Level Model Estimates: A Review and Illustration , 2014, Educational Psychology Review.

[75]  Daniela Cruzes,et al.  Research synthesis in software engineering: A tertiary study , 2011, Inf. Softw. Technol..

[76]  Natalia Juristo Juzgado,et al.  Content and structure of laboratory packages for software engineering experiments , 2017, Inf. Softw. Technol..

[77]  Steve Counsell,et al.  The role and value of replication in empirical software engineering results , 2018, Inf. Softw. Technol..

[78]  Harris Cooper,et al.  The relative benefits of meta-analysis conducted with individual participant data versus aggregated data. , 2009, Psychological methods.

[79]  R. Nickerson,et al.  Null hypothesis significance testing: a review of an old and continuing controversy. , 2000, Psychological methods.

[80]  M. S. Patel,et al.  An introduction to meta-analysis. , 1989, Health Policy.

[81]  Will Hayes,et al.  Research synthesis in software engineering: a case for meta-analysis , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[82]  Richard D Riley,et al.  Meta‐analysis using individual participant data: one‐stage and two‐stage approaches, and why they may differ , 2016, Statistics in medicine.

[83]  Jocelyn E. Bolin,et al.  Multilevel Modeling Using Mplus , 2017 .

[84]  Catrin Tudur Smith,et al.  Individual participant data meta-analyses compared with meta-analyses based on aggregate data , 2011, The Cochrane database of systematic reviews.

[85]  Harald C. Gall,et al.  Software Mining Studies: Goals, Approaches, Artifacts, and Replicability , 2013, LASER Summer School.

[86]  Óscar Dieste Tubío,et al.  Professionals Are Not Superman: Failures beyond Motivation in Software Experiments , 2017, 2017 IEEE/ACM 5th International Workshop on Conducting Empirical Studies in Industry (CESI).

[87]  Claes Wohlin,et al.  Empirical software engineering experts on the use of students and professionals in experiments , 2017, Empirical Software Engineering.

[88]  T. Lumley,et al.  The importance of the normality assumption in large public health data sets. , 2002, Annual review of public health.

[89]  T Greco,et al.  Review Article , 2022 .

[90]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[91]  Per Runeson,et al.  A Comparative Analysis of Three Replicated Experiments Comparing Inspection and Unit Testing , 2011, 2011 Second International Workshop on Replication in Empirical Software Engineering Research.

[92]  Ahnalee M. Brincks,et al.  Modeling Site Effects in the Design and Analysis of Multi-site Trials , 2011, The American journal of drug and alcohol abuse.

[93]  Jos Twisk,et al.  Multiple imputation of missing values was not necessary before performing a longitudinal mixed-model analysis. , 2013, Journal of clinical epidemiology.

[94]  M. Dixon-Woods,et al.  Including qualitative research in systematic reviews: opportunities and problems. , 2001, Journal of evaluation in clinical practice.

[95]  Jose-Norberto Mazón,et al.  A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses , 2010, Inf. Softw. Technol..

[96]  J. Higgins,et al.  Cochrane Handbook for Systematic Reviews of Interventions , 2010, International Coaching Psychology Review.

[97]  Natalia Juristo Juzgado,et al.  A process for managing interaction between experimenters to get useful similar replications , 2013, Inf. Softw. Technol..

[98]  L. Mbuagbaw,et al.  A tutorial on sensitivity analyses in clinical trials: the what, why, when and how , 2013, BMC Medical Research Methodology.

[99]  Helen Brown,et al.  Applied Mixed Models in Medicine , 2000, Technometrics.

[100]  James Miller Can results from software engineering experiments be safely combined? , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[101]  Tom A. B. Snijders,et al.  Multilevel Analysis , 2011, International Encyclopedia of Statistical Science.

[102]  K. Jones,et al.  Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data* , 2014, Political Science Research and Methods.

[103]  R. O’Neill,et al.  Multicentre trials: a US regulatory perspective , 2005, Statistical methods in medical research.

[104]  I. Cuthill,et al.  Effect size, confidence interval and statistical significance: a practical guide for biologists , 2007, Biological reviews of the Cambridge Philosophical Society.

[105]  A. Zuur,et al.  A Beginner’s Guide to GLM and GLMM with R: A Frequentist and Bayesian Perspective for Ecologists , 2013 .

[106]  Tore Dybå,et al.  The Future of Empirical Methods in Software Engineering Research , 2007, Future of Software Engineering (FOSE '07).

[107]  Natalia Juristo Juzgado,et al.  Comparing techniques for aggregating interrelated replications in software engineering , 2018, ESEM.

[108]  Stephanie J. C. Taylor,et al.  Methodological criteria for the assessment of moderators in systematic reviews of randomised controlled trials: a consensus study , 2011, BMC medical research methodology.

[109]  Silvia Mara Abrahão,et al.  Assessing the Effectiveness of Sequence Diagrams in the Comprehension of Functional Requirements: Results from a Family of Five Experiments , 2013, IEEE Transactions on Software Engineering.

[110]  C D Naylor,et al.  Meta-analysis of controlled clinical trials. , 1989, The Journal of rheumatology.

[111]  Karl E. Peace,et al.  Applied Meta-Analysis with R , 2013 .

[112]  P C Lambert,et al.  A comparison of summary patient-level covariates in meta-regression with individual patient data meta-analysis. , 2002, Journal of clinical epidemiology.

[113]  Jacob Cohen The earth is round (p < .05) , 1994 .

[114]  J. Hox,et al.  Sufficient Sample Sizes for Multilevel Modeling , 2005 .

[115]  Michael J. Crawley,et al.  The R book , 2022 .

[116]  Andy P. Field,et al.  Discovering Statistics Using Ibm Spss Statistics , 2017 .

[117]  Christopher H Schmid,et al.  Summing up evidence: one answer is not always enough , 1998, The Lancet.

[118]  J. Schafer,et al.  Missing data: our view of the state of the art. , 2002, Psychological methods.

[119]  Richard D Riley,et al.  Simulation-based power calculations for planning a two-stage individual participant data meta-analysis , 2018, BMC Medical Research Methodology.

[120]  Marco Torchiano,et al.  Assessing the Effect of Screen Mockups on the Comprehension of Functional Requirements , 2014, TSEM.

[121]  B. Olivier,et al.  The use of parametric effect sizes in single study musculoskeletal physiotherapy research: A practical primer. , 2018, Physical therapy in sport : official journal of the Association of Chartered Physiotherapists in Sports Medicine.

[122]  J. Carpenter,et al.  Meta-analytical methods to identify who benefits most from treatments: daft, deluded, or deft approach? , 2017, British Medical Journal.

[123]  N. Lazar,et al.  The ASA Statement on p-Values: Context, Process, and Purpose , 2016 .

[124]  Natalia Juristo Juzgado,et al.  Understanding replication of experiments in software engineering: A classification , 2014, Inf. Softw. Technol..

[125]  James Miller,et al.  Applying meta-analytical procedures to software engineering experiments , 2000, J. Syst. Softw..

[126]  A. Zuur,et al.  Mixed Effects Models and Extensions in Ecology with R , 2009 .

[127]  R. Koff,et al.  Meta-analysis, decision analysis, and cost-effectiveness analysis. Methods for quantitative synthesis in medicine , 1995 .

[128]  Craig,et al.  Corrigendum: Many Analysts, One Data Set: Making Transparent How Variations in Analytic Choices Affect Results , 2018, Advances in Methods and Practices in Psychological Science.

[129]  A Whitehead,et al.  Meta‐analysis of continuous outcome data from individual patients , 2001, Statistics in medicine.

[130]  Jeffrey C. Carver,et al.  Program comprehension of domain-specific and general-purpose languages: comparison using a family of experiments , 2011, Empirical Software Engineering.