Effect size analysis

When we seek insight in collected data we are most often forced to limit our measurements to a portion of all individuals that can be hypothetically considered for observation. Nevertheless, as researchers, we want to draw more general conclusions that are valid beyond the restricted subset we are currently analyzing. Statistical significance testing is a fundamental pattern of data analysis that helps us to infer conclusions from a subset about the entire set of possible individuals. However, the outcome of such tests depends on several factors. Software engineering experiments often address similar research questions but vary with respect to those factors, for example, they operate on different sizes or measurements. Hence, the use of statistical significance alone to interpret findings across studies is insufficient. This paper describes how significance testing can be extended by an analysis of the magnitude, i.e., effect size, of an observation allowing to abstract the results of different studies.

[1]  Carl J. Huberty,et al.  Statistical Practices of Educational Researchers: An Analysis of their ANOVA, MANOVA, and ANCOVA Analyses , 1998 .

[2]  Leland Wilkinson,et al.  Statistical Methods in Psychology Journals Guidelines and Explanations , 2005 .

[3]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007, IEEE Transactions on Software Engineering.

[4]  Premkumar T. Devanbu,et al.  Fair and balanced?: bias in bug-fix datasets , 2009, ESEC/FSE '09.

[5]  S. Dowdy,et al.  Statistics for Research: Dowdy/Statistics , 2005 .

[6]  A. Zeller,et al.  Predicting Defects for Eclipse , 2007, Third International Workshop on Predictor Models in Software Engineering (PROMISE'07: ICSE Workshops 2007).

[7]  James Miller,et al.  Applying meta-analytical procedures to software engineering experiments , 2000, J. Syst. Softw..

[8]  William A. Brenneman Statistics for Research , 2005, Technometrics.

[9]  A. Kühberger,et al.  A comprehensive review of reporting practices in psychological journals: Are effect sizes really enough? , 2013 .

[10]  Jacob Cohen,et al.  A power primer. , 1992, Psychological bulletin.

[11]  Natalia Juristo Juzgado,et al.  Using differences among replications of software engineering experiments to gain knowledge , 2009, 2009 3rd International Symposium on Empirical Software Engineering and Measurement.

[12]  Tore Dybå,et al.  A systematic review of effect size in software engineering experiments , 2007, Inf. Softw. Technol..

[13]  Rui Yao,et al.  Publication Manual of the American Psychological Association , 2011 .

[14]  L. Hedges,et al.  The Handbook of Research Synthesis and Meta-Analysis , 2009 .

[15]  Tim Menzies,et al.  Data Mining Static Code Attributes to Learn Defect Predictors , 2007 .

[16]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[17]  Thomas Ball,et al.  Static analysis tools as early indicators of pre-release defect density , 2005, ICSE.