Statistical significance testing--a panacea for software technology experiments?

Empirical software engineering has a long history of utilizing statistical significance testing, and in many ways, it has become the backbone of the topic. What is less obvious is how much consideration has been given to its adoption. Statistical significance testing was initially designed for testing hypotheses in a very different area, and hence the question must be asked: does it transfer into empirical software engineering research? This paper attempts to address this question. The paper finds that this transference is far from straightforward, resulting in several problems in its deployment within the area. Principally problems exist in: formulating hypotheses, the calculation of the probability values and its associated cut-off value, and the construction of the sample and its distribation. Hence, the paper concludes that the topic should explore other avenues of analysis, in an attempt to establish which analysis approaches are preferable under which conditions, when conducting empirical software engineering studies.

[1]  Jum C. Nunnally,et al.  The Place of Statistics in Psychology , 1960 .

[2]  James P. Shaver,et al.  What Statistical Significance Testing Is, and What It Is Not , 1993 .

[3]  R. Bechhofer A Single-Sample Multiple Decision Procedure for Ranking Means of Normal Populations with known Variances , 1954 .

[4]  James Miller,et al.  Applying meta-analytical procedures to software engineering experiments , 2000, J. Syst. Softw..

[5]  Marks R. Nester,et al.  An Applied Statistician's Creed , 1996 .

[6]  Witold Pedrycz,et al.  Practical assessment of the models for identification of defect-prone classes in object-oriented commercial systems using design metrics , 2003, J. Syst. Softw..

[7]  John W. Daly,et al.  Multi-method research: An empirical investigation of object-oriented technology , 1999, J. Syst. Softw..

[8]  F. Schmidt Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers , 1996 .

[9]  G. Smith,et al.  Statistical Reasoning , 1985 .

[10]  J. V. Grice,et al.  A data-based approach to statistics , 1994 .

[11]  D. Lindley A STATISTICAL PARADOX , 1957 .

[12]  Ruven E. Brooks,et al.  Studying programmer behavior experimentally: the problems of proper methodology , 1980, CACM.

[13]  John W. Daly,et al.  Statistical power and its subcomponents - missing and misunderstood concepts in empirical software engineering research , 1997, Inf. Softw. Technol..

[14]  Alan Stuart,et al.  Data-Dredging Procedures in Survey Analysis , 1966 .

[15]  M. J. Bayarri,et al.  Calibration of ρ Values for Testing Precise Null Hypotheses , 2001 .

[16]  Forrest Shull,et al.  Investigating Reading Techniques for Object-Oriented Framework Learning , 2000, IEEE Trans. Software Eng..

[17]  Lionel C. Briand,et al.  The impact of design properties on development cost in object-oriented systems , 2001, Proceedings Seventh International Software Metrics Symposium.

[18]  David Lindley,et al.  Bayesian Statistics, a Review , 1987 .

[19]  B. Curtis,et al.  Measurement and experimentation in software engineering , 1980, Proceedings of the IEEE.

[20]  M. L. Samuels Simpson's Paradox and Related Phenomena , 1993 .

[21]  Karl R. Popper The Logic of Scientific Discovery. , 1977 .

[22]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[23]  M. Gardner,et al.  Confidence intervals rather than P values: estimation rather than hypothesis testing. , 1986, British medical journal.

[24]  James F. Quinn,et al.  On Hypothesis Testing in Ecology and Evolution , 1983, The American Naturalist.

[25]  R. P. Carver The Case Against Statistical Significance Testing , 1978 .

[26]  H. Jeffreys,et al.  The Theory of Probability , 1896 .

[27]  Lionel C. Briand,et al.  Modeling Development Effort in Object-Oriented Systems Using Design Properties , 2001, IEEE Trans. Software Eng..

[28]  J. Berger,et al.  Testing Precise Hypotheses , 1987 .

[29]  W. Jefferys Bayesian Analysis of Random Event Generator Data , 1990 .

[30]  J. Berger,et al.  Testing a Point Null Hypothesis: The Irreconcilability of P Values and Evidence , 1987 .

[31]  E. C. Hammond,et al.  Smoking and lung cancer: recent evidence and a discussion of some questions. , 1959, Journal of the National Cancer Institute.

[32]  Forrest Shull,et al.  Building Knowledge through Families of Experiments , 1999, IEEE Trans. Software Eng..

[33]  R. Hilborn,et al.  On inference in ecology and evolutionary biology: the problem of multiple causes , 1982, Acta biotheoretica.

[34]  Jacob Cohen Statistical Power Analysis for the Behavioral Sciences , 1969, The SAGE Encyclopedia of Research Design.

[35]  R. Matthews,et al.  FAITH, HOPE AND STATISTICS , 1997 .

[36]  J. Berger,et al.  [Testing Precise Hypotheses]: Rejoinder , 1987 .

[37]  James O. Berger,et al.  The Relevance of Stopping Rules in Statistical Inference , 1988 .

[38]  Victor R. Basili,et al.  A Controlled Experiment Quantitatively Comparing Software Development Approaches , 1981, IEEE Transactions on Software Engineering.

[39]  W. W. Rozeboom The fallacy of the null-hypothesis significance test. , 1960, Psychological bulletin.

[40]  David A. Gustafson,et al.  Shotgun correlations in software measures , 1993, Softw. Eng. J..

[41]  A. Hunter,et al.  Multimethod Research: A Synthesis of Styles , 1989 .

[42]  S. Gupta,et al.  Statistical decision theory and related topics IV , 1988 .