The role and value of replication in empirical software engineering results

Abstract Context Concerns have been raised from many quarters regarding the reliability of empirical research findings and this includes software engineering. Replication has been proposed as an important means of increasing confidence. Objective We aim to better understand the value of replication studies, the level of confirmation between replication and original studies, what confirmation means in a statistical sense and what factors modify this relationship. Method We perform a systematic review to identify relevant replication experimental studies in the areas of (i) software project effort prediction and (ii) pair programming. Where sufficient details are provided we compute prediction intervals. Results Our review locates 28 unique articles that describe replications of 35 original studies that address 75 research questions. Of these 10 are external, 15 internal and 3 internal-same-article replications. The odds ratio of internal to external (conducted by independent researchers) replications of obtaining a ‘confirmatory’ result is 8.64. We also found incomplete reporting hampered our ability to extract estimates of effect sizes. Where we are able to compute replication prediction intervals these were surprisingly large. Conclusion We show that there is substantial evidence to suggest that current approaches to empirical replications are highly problematic. There is a consensus that replications are important, but there is a need for better reporting of both original and replicated studies. Given the low power and incomplete reporting of many original studies, it can be unclear the extent to which a replication is confirmatory and to what extent it yields additional knowledge to the software engineering community. We recommend attention is switched from replication research to meta-analysis.

[1]  Forrest Shull,et al.  Building Knowledge through Families of Experiments , 1999, IEEE Trans. Software Eng..

[2]  Tore Dybå,et al.  A systematic review of statistical power in software engineering experiments , 2006, Inf. Softw. Technol..

[3]  Tore Dyb,et al.  Incorrect results in software engineering experiments , 2016 .

[4]  Andrew Gelman,et al.  Measurement error and the replication crisis , 2017, Science.

[5]  Lionel C. Briand,et al.  An assessment and comparison of common software cost estimation modeling techniques , 1999, Proceedings of the 1999 International Conference on Software Engineering (IEEE Cat. No.99CB37002).

[6]  Emilia Mendes,et al.  Investigating the use of duration-based moving windows to improve software effort prediction: A replicated study , 2014, Inf. Softw. Technol..

[7]  Christopher J. Lokan An empirical study of the correlations between function point elements [software metrics] , 1999, Proceedings Sixth International Software Metrics Symposium (Cat. No.PR00403).

[8]  Ioannis Stamelos,et al.  Investigating the Impact of Personality Types on Communication and Collaboration-Viability in Pair Programming - An Empirical Study , 2006, XP.

[9]  W. Stroebe,et al.  The Alleged Crisis and the Illusion of Exact Replication , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[10]  Ioannis Stamelos,et al.  Investigating the Impact of Personality and Temperament Traits on Pair Programming: A Controlled Experiment Replication , 2012, 2012 Eighth International Conference on the Quality of Information and Communications Technology.

[11]  J. Ioannidis Why Most Discovered True Associations Are Inflated , 2008, Epidemiology.

[12]  Luciano Baresi,et al.  Three empirical studies on estimating the design effort of Web applications , 2007, TSEM.

[13]  Davide Taibi,et al.  Functional Size Measures and Effort Estimation in Agile Development: A Replicated Study , 2015, XP.

[14]  Jeffrey R. Spence,et al.  Expectations for Replications , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[15]  Ali Idri,et al.  Software cost estimation by classical and Fuzzy Analogy for Web Hypermedia Applications: A replicated study , 2013, 2013 IEEE Symposium on Computational Intelligence and Data Mining (CIDM).

[16]  Ross Jeffery,et al.  Function point sizing: Structure, validity and applicability , 2004, Empirical Software Engineering.

[17]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[18]  Ingunn Myrtveit,et al.  A Controlled Experiment to Assess the Benefits of Estimating with Analogy and Regression Models , 1999, IEEE Trans. Software Eng..

[19]  Christian Quesada-López,et al.  Function Point Structure and Applicability: A Replicated Study , 2016, J. Object Technol..

[20]  B. Kitchenham,et al.  Inter-item correlations among function points , 1993, Proceedings of 1993 15th International Conference on Software Engineering.

[21]  L. Hedges,et al.  Statistical Methods for Meta-Analysis , 1987 .

[22]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[23]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[24]  Tore Dybå,et al.  The effectiveness of pair programming: A meta-analysis , 2009, Inf. Softw. Technol..

[25]  Emilia Mendes,et al.  Cross-company and single-company effort models using the ISBSG database: a further replicated study , 2006, ISESE '06.

[26]  Camiel J. Beukeboom,et al.  Blinded by the Light: How a Focus on Statistical “Significance” May Cause p-Value Misreporting and an Excess of p-Values Just Below .05 in Communication Science , 2015 .

[27]  D. Ross Jeffery,et al.  Using Web objects for estimating software development effort for Web applications , 2003, Proceedings. 5th International Workshop on Enterprise Networking and Computing in Healthcare Industry (IEEE Cat. No.03EX717).

[28]  Emilia Mendes,et al.  Applying moving windows to software effort estimation , 2009, ESEM 2009.

[29]  Alain Abran,et al.  RBFN Networks-based Models for Estimating Software Development Effort: A Cross-validation Study , 2015, 2015 IEEE Symposium Series on Computational Intelligence.

[30]  Michelle Cartwright,et al.  A replication of the use of regression towards the mean (R2M) as an adjustment to effort estimation models , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[31]  Jeffrey R. Spence,et al.  Prediction Interval: What to Expect When You’re Expecting … A Replication , 2016, PloS one.

[32]  Samantha F. Anderson,et al.  There's more than one way to conduct a replication study: Beyond statistical significance. , 2016, Psychological methods.

[33]  Jiming Liu,et al.  Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range , 2014, BMC Medical Research Methodology.

[34]  Emilia Mendes,et al.  Investigating the use of chronological splitting to compare software cross-company and single-company effort predictions , 2008 .

[35]  D. Ross Jeffery,et al.  Using public domain metrics to estimate software development effort , 2001, Proceedings Seventh International Software Metrics Symposium.

[36]  Pearl Brereton,et al.  Evidence-Based Software Engineering and Systematic Reviews , 2015 .

[37]  Emilia Mendes,et al.  Investigating the effects of personality traits on pair programming in a higher education setting through a family of experiments , 2012, Empirical Software Engineering.

[38]  Filomena Ferrucci,et al.  Which COSMIC Base Functional Components are Significant in Estimating Web Application Development? - A Case Study , 2010 .

[39]  Ronnie E. S. Santos,et al.  Investigations about replication of empirical studies in software engineering: A systematic mapping study , 2015, Inf. Softw. Technol..

[40]  Tore Dybå,et al.  Evidence-based software engineering , 2004, Proceedings. 26th International Conference on Software Engineering.

[41]  R. Peng Reproducible Research in Computational Science , 2011, Science.

[42]  Emilia Mendes,et al.  A replicated comparison of cross-company and within-company effort estimation models using the ISBSG database , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[43]  Aybüke Aurum,et al.  Evaluation of effects of pair work on quality of designs , 2005, 2005 Australian Software Engineering Conference.

[44]  Filomena Ferrucci,et al.  Using Web Objects for Development Effort Estimation of Web Applications: A Replicated Study , 2011, PROFES.

[45]  Jeffrey C. Carver,et al.  The role of replications in Empirical Software Engineering , 2008, Empirical Software Engineering.

[46]  Emilia Mendes,et al.  A replicated assessment of the use of adaptation rules to improve Web cost estimation , 2003, 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings..

[47]  Emilia Mendes,et al.  Investigating the Use of Chronological Splitting to Compare Software Cross-company and Single-company Effort Predictions: A Replicated Study , 2009, EASE.

[48]  Amela Karahasanovic,et al.  A survey of controlled experiments in software engineering , 2005, IEEE Transactions on Software Engineering.

[49]  Douglas G. Altman,et al.  Practical statistics for medical research , 1990 .

[50]  Christian Quesada-López,et al.  An Empirical Validation of Function Point Structure and Applicability: A Replication Study , 2015, CIbSE.

[51]  Davide Taibi,et al.  Can Functional Size Measures Improve Effort Estimation in SCRUM , 2014, ICSEA 2014.

[52]  Ronnie E. S. Santos,et al.  Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[53]  Marcus Ciolkowski What do we know about perspective-based reading? An approach for quantitative aggregation in software engineering , 2009, ESEM 2009.

[54]  James Miller,et al.  Replicating software engineering experiments: a poisoned chalice or the Holy Grail , 2005, Inf. Softw. Technol..

[55]  Emilia Mendes,et al.  Investigating the use of moving windows to improve software effort prediction: a replicated study , 2017, Empirical Software Engineering.

[56]  Emilia Mendes,et al.  A replicated experiment of pair-programming in a 2nd-year software development and design computer science course , 2006, ITICSE '06.

[57]  Magne Jørgensen,et al.  Software effort estimation by analogy and "regression toward the mean" , 2003, J. Syst. Softw..

[58]  Lefteris Angelis,et al.  Using Ensembles for Web Effort Estimation , 2013, 2013 ACM / IEEE International Symposium on Empirical Software Engineering and Measurement.

[59]  D. Simons The Value of Direct Replication , 2014, Perspectives on psychological science : a journal of the Association for Psychological Science.

[60]  Christian Quesada-López,et al.  Function point structure and applicability validation using the ISBSG dataset: a replicated study , 2014, ESEM '14.

[61]  Emilia Mendes,et al.  Cross-company vs. single-company web effort models using the Tukutuku database: An extended study , 2008, J. Syst. Softw..

[62]  Emilia Mendes,et al.  Do adaptation rules improve web cost estimation? , 2003, HYPERTEXT '03.

[63]  Barbara A. Kitchenham,et al.  Combining empirical results in software engineering , 1998, Inf. Softw. Technol..

[64]  J. Wicherts,et al.  The Rules of the Game Called Psychological Science , 2012, Perspectives on psychological science : a journal of the Association for Psychological Science.

[65]  Natalia Juristo Juzgado,et al.  Understanding replication of experiments in software engineering: A classification , 2014, Inf. Softw. Technol..

[66]  Emilia Mendes,et al.  Further comparison of cross-company and within-company effort estimation models for Web applications , 2004 .

[67]  Barbara A. Kitchenham,et al.  The role of replications in empirical software engineering—a word of warning , 2008, Empirical Software Engineering.

[68]  Ioannis Stamelos,et al.  An experimental investigation of personality types impact on pair effectiveness in pair programming , 2009, Empirical Software Engineering.

[69]  I Diane Cooper,et al.  What is a "mapping study?". , 2016, Journal of the Medical Library Association : JMLA.

[70]  Xin Yao,et al.  How to make best use of cross-company data in software effort estimation? , 2014, ICSE.

[71]  Ali Idri,et al.  SOFTWARE COST ESTIMATION BY FUZZY ANALOGY FOR ISBSG REPOSITORY , 2012 .

[72]  William R. Shadish,et al.  Using odds ratios as effect sizes for meta-analysis of dichotomous data: A primer on methods and issues. , 1998 .

[73]  D. Ross Jeffery,et al.  A Comparison of Function Point Counting Techniques , 1993, IEEE Trans. Software Eng..

[74]  Emilia Mendes,et al.  How to Make Best Use of Cross-Company Data for Web Effort Estimation? , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[75]  Shari Lawrence Pfleeger,et al.  Experimental design and analysis in software engineering , 1995, Ann. Softw. Eng..

[76]  Fabiano Cutigi Ferrari,et al.  Development of auxiliary functions: Should you be agile? An empirical assessment of pair programming and test-first programming , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[77]  Emilia Mendes,et al.  Investigating pair-programming in a 2nd-year software development and design computer science course , 2005, ITiCSE '05.

[78]  Paul D. Ellis,et al.  The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[79]  Anthony Robins,et al.  Problem distributions in a CS1 course , 2006 .

[80]  Sousuke Amasaki,et al.  The Evaluation of Weighted Moving Windows for Software Effort Estimation , 2013, PROFES.

[81]  J. Leek,et al.  What Should Researchers Expect When They Replicate Studies? A Statistical View of Replicability in Psychological Science , 2016, Perspectives on psychological science : a journal of the Association for Psychological Science.

[82]  Sousuke Amasaki,et al.  A replication study on the effects of weighted moving windows for software effort estimation , 2016, EASE.

[83]  D. Murdoch,et al.  P-Values are Random Variables , 2008 .

[84]  Michael C. Frank,et al.  Estimating the reproducibility of psychological science , 2015, Science.

[85]  Sandro Morasca,et al.  Towards a simplified definition of Function Points , 2013, Inf. Softw. Technol..

[86]  F. Korner‐Nievergelt,et al.  The earth is flat (p > 0.05): significance thresholds and the crisis of unreplicable research , 2017, PeerJ.

[87]  D. Ross Jeffery,et al.  Cost estimation for web applications , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[88]  J. Ioannidis Why Most Published Research Findings Are False , 2019, CHANCE.

[89]  Martin J. Shepperd,et al.  How Do I Know Whether to Trust a Research Result? , 2015, IEEE Software.

[90]  Victor R. Basili,et al.  Experimentation in software engineering , 1986, IEEE Transactions on Software Engineering.

[91]  Thilo Mende,et al.  Replication of defect prediction studies: problems, pitfalls and recommendations , 2010, PROMISE '10.

[92]  Tim Menzies,et al.  On the Value of Ensemble Effort Estimation , 2012, IEEE Transactions on Software Engineering.

[93]  Taghi M. Khoshgoftaar,et al.  Estimating software project effort by analogy based on linguistic values , 2002, Proceedings Eighth IEEE Symposium on Software Metrics.

[94]  Emilia Mendes,et al.  Replicating studies on cross- vs single-company effort models using the ISBSG Database , 2008, Empirical Software Engineering.

[95]  Mario Piattini,et al.  Performances of pair designing on software evolution: a controlled experiment , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[96]  Brian Hanks Problems encountered by novice pair programmers , 2008, ACM J. Educ. Resour. Comput..

[97]  Jeffrey C. Carver Towards Reporting Guidelines for Experimental Replications: A Proposal , 2010 .

[98]  Lionel C. Briand,et al.  A replicated assessment and comparison of common software cost modeling techniques , 2000, Proceedings of the 2000 International Conference on Software Engineering. ICSE 2000 the New Millennium.

[99]  Çigdem Gencel,et al.  Impact of Base Functional Component Types on Software Functional Size Based Effort Estimation , 2008, PROFES.

[100]  G. Cumming Replication and p Intervals: p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better , 2008, Perspectives on psychological science : a journal of the Association for Psychological Science.

[101]  Fabio Q. B. da Silva,et al.  Replication of empirical studies in software engineering research: a systematic mapping study , 2012, Empirical Software Engineering.

[102]  L. Hedges,et al.  Introduction to Meta‐Analysis , 2009, International Coaching Psychology Review.