Using blind analysis for software engineering experiments

Context: In recent years there has been growing concern about conflicting experimental results in empirical software engineering. This has been paralleled by awareness of how bias can impact research results. Objective: To explore the practicalities of blind analysis of experimental results to reduce bias. Method: We apply blind analysis to a real software engineering experiment that compares three feature weighting approaches with a naïve benchmark (sample mean) to the Finnish software effort data set. We use this experiment as an example to explore blind analysis as a method to reduce researcher bias. Results: Our experience shows that blinding can be a relatively straightforward procedure. We also highlight various statistical analysis decisions which ought not be guided by the hunt for statistical significance and show that results can be inverted merely through a seemingly inconsequential statistical nicety (i.e., the degree of trimming). Conclusion: Whilst there are minor challenges and some limits to the degree of blinding possible, blind analysis is a very practical and easy to implement method that supports more objective analysis of experimental results. Therefore we argue that blind analysis should be the norm for analysing software engineering experiments.

[1]  Morten W Fagerland,et al.  The Wilcoxon–Mann–Whitney test under scrutiny , 2009, Statistics in medicine.

[2]  Tore Dybå,et al.  Incorrect results in software engineering experiments: How to improve research practices , 2016, J. Syst. Softw..

[3]  Martin J. Shepperd,et al.  Search Heuristics, Case-based Reasoning And Software Project Effort Prediction , 2002, GECCO.

[4]  Frank E. Harrell,et al.  A new distribution-free quantile estimator , 1982 .

[5]  E Aprile,et al.  Dark matter results from 225 live days of XENON100 data. , 2012, Physical review letters.

[6]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[7]  Martin J. Shepperd,et al.  Feature weighting techniques for CBR in software effort estimation studies: a review and empirical evaluation , 2014, PROMISE.

[8]  Paul D. Ellis,et al.  The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[9]  Rand R. Wilcox Pairwise comparisons of dependent groups based on medians , 2006, Comput. Stat. Data Anal..

[10]  Tim Menzies,et al.  Special issue on repeatable results in software engineering prediction , 2012, Empirical Software Engineering.

[11]  R. Rosenthal The file drawer problem and tolerance for null results , 1979 .

[12]  Magne Jørgensen,et al.  A Systematic Review of Software Development Cost Estimation Studies , 2007, IEEE Transactions on Software Engineering.

[13]  Martin Shepperd,et al.  Case and Feature Subset Selection in Case-Based Software Project Effort Prediction , 2003 .

[14]  J. Ioannidis Why Most Published Research Findings Are False , 2005, PLoS medicine.

[15]  K. Dickersin The existence of publication bias and risk factors for its occurrence. , 1990, JAMA.

[16]  Yong Hu,et al.  Systematic literature review of machine learning based software development effort estimation models , 2012, Inf. Softw. Technol..

[17]  Julia Kastner,et al.  Introduction to Robust Estimation and Hypothesis Testing , 2005 .

[18]  Martin J. Shepperd,et al.  Estimating Software Project Effort Using Analogies , 1997, IEEE Trans. Software Eng..

[19]  C. Sitthi-amorn,et al.  Bias , 1993, The Lancet.

[20]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[21]  Stephen G. MacDonell,et al.  What accuracy statistics really measure , 2001, IEE Proc. Softw..

[22]  Stephen G. MacDonell,et al.  Evaluating prediction systems in software project estimation , 2012, Inf. Softw. Technol..

[23]  Barbara A. Kitchenham,et al.  A Simulation Study of the Model Evaluation Criterion MMRE , 2003, IEEE Trans. Software Eng..

[24]  B. Efron,et al.  A Leisurely Look at the Bootstrap, the Jackknife, and , 1983 .

[25]  Rand Wilcox Chapter 10 – Robust Regression , 2012 .

[26]  E. Ziegel Introduction to Robust Estimation and Hypothesis Testing (2nd ed.) , 2005 .

[27]  P. Williamson,et al.  Bias in meta‐analysis due to outcome variable selection within studies , 2000 .

[28]  Boyce Sigweni Feature weighting for case-based reasoning software project effort estimation , 2014, EASE '14.

[29]  Tore Dyb,et al.  Incorrect results in software engineering experiments , 2016 .