Three empirical studies on the agreement of reviewers about the quality of software engineering experiments

Context: During systematic literature reviews it is necessary to assess the quality of empirical papers. Current guidelines suggest that two researchers should independently apply a quality checklist and any disagreements must be resolved. However, there is little empirical evidence concerning the effectiveness of these guidelines. Aims: This paper investigates the three techniques that can be used to improve the reliability (i.e. the consensus among reviewers) of quality assessments, specifically, the number of reviewers, the use of a set of evaluation criteria and consultation among reviewers. We undertook a series of studies to investigate these factors. Method: Two studies involved four research papers and eight reviewers using a quality checklist with nine questions. The first study was based on individual assessments, the second study on joint assessments with a period of inter-rater discussion. A third more formal randomised block experiment involved 48 reviewers assessing two of the papers used previously in teams of one, two and three persons to assess the impact of discussion among teams of different size using the evaluations of the ''teams'' of one person as a control. Results: For the first two studies, the inter-rater reliability was poor for individual assessments, but better for joint evaluations. However, the results of the third study contradicted the results of Study 2. Inter-rater reliability was poor for all groups but worse for teams of two or three than for individuals. Conclusions: When performing quality assessments for systematic literature reviews, we recommend using three independent reviewers and adopting the median assessment. A quality checklist seems useful but it is difficult to ensure that the checklist is both appropriate and understood by reviewers. Furthermore, future experiments should ensure participants are given more time to understand the quality checklist and to evaluate the research papers.

[1]  Bernice B Brown,et al.  DELPHI PROCESS: A METHODOLOGY USED FOR THE ELICITATION OF OPINIONS OF EXPERTS , 1968 .

[2]  Lianping Chen,et al.  Evaluation and Assessment in Software Engineering a Status Report on the Evaluation of Variability Management Approaches , 2022 .

[3]  Anna-Bettina Haidich,et al.  Any casualties in the clash of randomised and observational evidence? , 2001, BMJ : British Medical Journal.

[4]  Magne Jørgensen,et al.  Impact of effort estimates on software project work , 2001, Inf. Softw. Technol..

[5]  Ali Selamat,et al.  Information and Software Technology , 2014 .

[6]  L. Bornmann,et al.  A Reliability-Generalization Study of Journal Peer Reviews: A Multilevel Meta-Analysis of Inter-Rater Reliability and Its Determinants , 2010, PloS one.

[7]  A. Eberlein,et al.  Requirements Engineering for Software Product Lines , 2002 .

[8]  M Michele,et al.  Any casualties in the clash of randomised and observational evidence , 2001 .

[9]  David M. Schultz,et al.  Are three heads better than two? How the number of reviewers and editor behavior affect the rejection rate , 2010, Scientometrics.

[10]  M. E. Shaw Group dynamics : the psychology of small group behavior , 1971 .

[11]  Claes Wohlin,et al.  Pair-wise comparisons versus planning game partitioning—experiments on requirements prioritisation techniques , 2007, Empirical Software Engineering.

[12]  Tore Dybå,et al.  Empirical studies of agile software development: A systematic review , 2008, Inf. Softw. Technol..

[13]  J. Sterne,et al.  How important are comprehensive literature searches and the assessment of trial quality in systematic reviews? Empirical study. , 2003, Health technology assessment.

[14]  Amela Karahasanovic,et al.  A survey of controlled experiments in software engineering , 2005, IEEE Transactions on Software Engineering.

[15]  Barbara Howell,et al.  The Reliability of Peer Reviews of Papers on Information Systems , 2004, J. Inf. Sci..

[16]  Shari Lawrence Pfleeger,et al.  Preliminary Guidelines for Empirical Research in Software Engineering , 2002, IEEE Trans. Software Eng..

[17]  Mohit Bhandari,et al.  Reviewer agreement in scoring 419 abstracts for scientific orthopedics meetings , 2007, Acta orthopaedica.

[18]  Silvia Mara Abrahão,et al.  Experimental evaluation of an object-oriented function point measurement procedure , 2007, Inf. Softw. Technol..

[19]  George Valença,et al.  Accepted Manuscript Requirements Engineering for Software Product Lines: a Systematic Literature Review Accepted Manuscript Requirements Engineering for Software Product Lines: a Systematic Literature Review Accepted Manuscript , 2022 .

[20]  M. Petticrew,et al.  Systematic Reviews in the Social Sciences: A Practical Guide , 2005 .

[21]  Jonathan AC Sterne,et al.  Are the clinical effects of homoeopathy placebo effects? Comparative study of placebo-controlled trials of homoeopathy and allopathy , 2005, The Lancet.

[22]  Wasif Afzal,et al.  A systematic review of search-based testing for non-functional system properties , 2009, Inf. Softw. Technol..

[23]  Brian H Rowe,et al.  Reviewer agreement trends from four years of electronic submissions of conference abstract , 2006 .

[24]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[25]  M. Egger,et al.  The hazards of scoring the quality of clinical trials for meta-analysis. , 1999, JAMA.

[26]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[27]  TorkarRichard,et al.  A systematic review of search-based testing for non-functional system properties , 2009 .

[28]  P. R. Laughlin,et al.  Groups perform better than the best individuals on letters-to-numbers problems: effects of group size. , 2006, Journal of Personality and Social Psychology.

[29]  Jay F. Nunamaker,et al.  Lessons from a Dozen Years of Group Support Systems Research: A Discussion of Lab and Field Findings , 1996, J. Manag. Inf. Syst..

[30]  Jack Meadows,et al.  Editorial Peer Review: Its Strengths and Weaknesses , 2002, J. Documentation.

[31]  H. Marsh,et al.  Improving the Peer-review Process for Grant Applications , 2022 .

[32]  Tore Dybå,et al.  Evidence-based software engineering , 2004, Proceedings. 26th International Conference on Software Engineering.

[33]  J. Fleiss,et al.  Intraclass correlations: uses in assessing rater reliability. , 1979, Psychological bulletin.

[34]  J. Gillon,et al.  Group dynamics , 1996 .

[35]  J. Olden,et al.  Is Peer Review a Game of Chance? , 2006 .

[36]  P. R. Laughlin,et al.  Collective Induction: Twelve Postulates. , 1999, Organizational behavior and human decision processes.

[37]  Magne Jørgensen,et al.  How large are software cost overruns? A review of the 1994 CHAOS report , 2006, Inf. Softw. Technol..

[38]  Per Runeson,et al.  Can we evaluate the quality of software engineering experiments? , 2010, ESEM '10.

[39]  Pearl Brereton,et al.  Refining the systematic literature review process—two participant-observer case studies , 2010, Empirical Software Engineering.

[40]  Jacob Cohen,et al.  Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. , 1968 .

[41]  A. Weller Editorial peer review , 2001 .

[42]  E. Berscheid,et al.  Group Dynamics: The Psychology of Small Group Behavior. 3rd ed. , 1981 .

[43]  Hui Liu,et al.  Testing input validation in Web applications through automated model recovery , 2008, J. Syst. Softw..

[44]  S. Karau,et al.  The effects of time scarcity and time abundance on group performance quality and interaction process. , 1992 .

[45]  Jay F. Nunamaker,et al.  Electronic meeting systems , 1991, CACM.

[46]  G. Omenn,et al.  Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. , 1996, The New England journal of medicine.

[47]  Tore Dybå,et al.  Strength of evidence in systematic reviews in software engineering , 2008, ESEM '08.

[48]  Vigdis Kampenes,et al.  Quality of Design, Analysis and Reporting of Software Engineering Experiments:A Systematic Review , 2007 .

[49]  I. Janis Groupthink: Psychological Studies of Policy Decisions and Fiascoes , 1982 .

[50]  Amela Karahasanovic,et al.  Comprehension strategies and difficulties in maintaining object-oriented systems: An explorative study , 2007, J. Syst. Softw..

[51]  Tore Dybå,et al.  A systematic review of statistical power in software engineering experiments , 2006, Inf. Softw. Technol..

[52]  R. Wiklund,et al.  Vitamin E Supplementation and Cardiovascular Events in High-Risk Patients , 2000 .

[53]  D. Myers,et al.  The group polarization phenomenon. , 1976 .