Can we evaluate the quality of software engineering experiments?

Context: The authors wanted to assess whether the quality of published human-centric software engineering experiments was improving. This required a reliable means of assessing the quality of such experiments. Aims: The aims of the study were to confirm the usability of a quality evaluation checklist, determine how many reviewers were needed per paper that reports an experiment, and specify an appropriate process for evaluating quality. Method: With eight reviewers and four papers describing human-centric software engineering experiments, we used a quality checklist with nine questions. We conducted the study in two parts: the first was based on individual assessments and the second on collaborative evaluations. Results: The inter-rater reliability was poor for individual assessments but much better for joint evaluations. Four reviewers working in two pairs with discussion were more reliable than eight reviewers with no discussion. The sum of the nine criteria was more reliable than individual questions or a simple overall assessment. Conclusions: If quality evaluation is critical, more than two reviewers are required and a round of discussion is necessary. We advise using quality criteria and basing the final assessment on the sum of the aggregated criteria. The restricted number of papers used and the relatively extensive expertise of the reviewers limit our results. In addition, the results of the second part of the study could have been affected by removing a time restriction on the review as well as the consultation process.

[1]  Tore Dybå,et al.  Evidence-Based Software Engineering for Practitioners , 2005, IEEE Softw..

[2]  Pearl Brereton,et al.  Refining the systematic literature review process—two participant-observer case studies , 2010, Empirical Software Engineering.

[3]  Jack Meadows,et al.  Editorial Peer Review: Its Strengths and Weaknesses , 2002, J. Documentation.

[4]  Magne Jørgensen,et al.  How large are software cost overruns? A review of the 1994 CHAOS report , 2006, Inf. Softw. Technol..

[5]  Tore Dybå,et al.  Empirical studies of agile software development: A systematic review , 2008, Inf. Softw. Technol..

[6]  Barbara Howell,et al.  The Reliability of Peer Reviews of Papers on Information Systems , 2004, J. Inf. Sci..

[7]  Mohit Bhandari,et al.  Reviewer agreement in scoring 419 abstracts for scientific orthopedics meetings , 2007, Acta orthopaedica.

[8]  David M. Schultz,et al.  Are three heads better than two? How the number of reviewers and editor behavior affect the rejection rate , 2010, Scientometrics.

[9]  Claes Wohlin,et al.  Pair-wise comparisons versus planning game partitioning—experiments on requirements prioritisation techniques , 2007, Empirical Software Engineering.

[10]  Lutz Bornmann,et al.  Scientific peer review , 2011, Annu. Rev. Inf. Sci. Technol..

[11]  Wasif Afzal,et al.  A systematic review of search-based testing for non-functional system properties , 2009, Inf. Softw. Technol..

[12]  Vigdis Kampenes,et al.  Quality of Design, Analysis and Reporting of Software Engineering Experiments:A Systematic Review , 2007 .

[13]  Silvia Mara Abrahão,et al.  Experimental evaluation of an object-oriented function point measurement procedure , 2007, Inf. Softw. Technol..

[14]  Tore Dybå,et al.  The Future of Empirical Methods in Software Engineering Research , 2007, Future of Software Engineering (FOSE '07).

[15]  Jonathan AC Sterne,et al.  Are the clinical effects of homoeopathy placebo effects? Comparative study of placebo-controlled trials of homoeopathy and allopathy , 2005, The Lancet.

[16]  Amela Karahasanovic,et al.  Comprehension strategies and difficulties in maintaining object-oriented systems: An explorative study , 2007, J. Syst. Softw..

[17]  Tore Dybå,et al.  Strength of evidence in systematic reviews in software engineering , 2008, ESEM '08.

[18]  Magne Jørgensen,et al.  Impact of effort estimates on software project work , 2001, Inf. Softw. Technol..

[19]  L. Bornmann,et al.  A Reliability-Generalization Study of Journal Peer Reviews: A Multilevel Meta-Analysis of Inter-Rater Reliability and Its Determinants , 2010, PloS one.

[20]  Brian H Rowe,et al.  Reviewer agreement trends from four years of electronic submissions of conference abstract , 2006 .

[21]  A. Weller Editorial peer review , 2001 .

[22]  J. Olden,et al.  Is Peer Review a Game of Chance? , 2006 .

[23]  Hui Liu,et al.  Testing input validation in Web applications through automated model recovery , 2008, J. Syst. Softw..

[24]  Pearl Brereton,et al.  An Evaluation of Quality Checklist Proposals - A participant-observer case study , 2009, EASE.

[25]  H. Marsh,et al.  Improving the Peer-review Process for Grant Applications , 2022 .

[26]  Tore Dybå,et al.  Evidence-based software engineering , 2004, Proceedings. 26th International Conference on Software Engineering.