Repeatability of systematic literature reviews

Background: One of the anticipated benefits of systematic literature reviews (SLRs) is that they can be conducted in an auditable way to produce repeatable results. Aim: This study aims to identify under what conditions SLRs are likely to be stable, with respect to the primary studies selected, when used in software engineering. The conditions we investigate in this report are when novice researchers undertake searches with a common goal. Method: We undertook a participant-observer multi-case study to investigate the repeatability of systematic literature reviews. The "cases" in this study were the early stages, involving identification of relevant literature, of two SLRs of unit testing methods. The SLRs were performed independently by two novice researchers. The SLRs were restricted to the ACM and IEEE digital libraries for the years 1986-2005 so their results could be compared with a published expert literature review of unit testing papers. Results: The two SLRs selected very different papers with only six papers out of 32 in common, and both differed substantially from a published secondary study of unit testing papers finding only three of 21 papers. Of the 29 additional papers found by the novice researchers, only 10 were considered relevant. The 10 additional relevant papers would have had an impact on the results of the published study by adding three new categories to the framework and adding papers to three, otherwise empty, cells. Conclusions: In the case of novice researchers, having broadly the same research question will not necessarily guarantee repeatability with respect to primary studies. Systematic reviews must be careful to report their search process fully or they will not be repeatable. Missing papers can have a significant impact on the stability of the results of a secondary study.

[1]  W. Eric Wong,et al.  Fault detection effectiveness of mutation and data flow testing , 1995, Software Quality Journal.

[2]  Alessandro Orso,et al.  Regression test selection for Java software , 2001, OOPSLA '01.

[3]  Roland H. Untch,et al.  Mutation-based software testing using program schemata , 1992, ACM Southeast Regional Conference.

[4]  Phyllis G. Frankl,et al.  All-uses vs mutation testing: An experimental comparison of effectiveness , 1997, J. Syst. Softw..

[5]  James Miller,et al.  An empirical evaluation of defect detection techniques , 1997, Inf. Softw. Technol..

[6]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[7]  Nancy G. Leveson,et al.  An empirical evaluation of the MC/DC coverage criterion on the HETE-2 satellite software , 2000, 19th DASC. 19th Digital Avionics Systems Conference. Proceedings (Cat. No.00CH37126).

[8]  Gregg Rothermel,et al.  Regression test selection for C++ software , 2000 .

[9]  R. Yin Case Study Research: Design and Methods , 1984 .

[10]  James H. Andrews,et al.  General Test Result Checking with Log File Analysis , 2003, IEEE Trans. Software Eng..

[11]  Sarfraz Khurshid,et al.  Korat: automated testing based on Java predicates , 2002, ISSTA '02.

[12]  Stefan Wappler,et al.  Using evolutionary algorithms for the unit testing of object-oriented software , 2005, GECCO '05.

[13]  Natalia Juristo Juzgado,et al.  Reviewing 25 Years of Testing Technique Experiments , 2004, Empirical Software Engineering.

[14]  Natalia Juristo Juzgado,et al.  In Search of What We Experimentally Know about Unit Testing , 2006, IEEE Software.

[15]  Pearl Brereton,et al.  Refining the systematic literature review process—two participant-observer case studies , 2010, Empirical Software Engineering.

[16]  Pearl Brereton,et al.  The educational value of mapping studies of software engineering literature , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[17]  Stephen H. Edwards,et al.  Experiences evaluating the effectiveness of JML-JUnit testing , 2004, SOEN.

[18]  P. Netisopakul,et al.  Data coverage testing , 2002, Ninth Asia-Pacific Software Engineering Conference, 2002..

[19]  Emilia Mendes,et al.  How Reliable Are Systematic Reviews in Empirical Software Engineering? , 2010, IEEE Transactions on Software Engineering.

[20]  A. Jefferson Offutt,et al.  Experimental results from an automatic test case generator , 1993, TSEM.