Crossover Designs in Software Engineering Experiments: Benefits and Perils

In experiments with crossover design subjects apply more than one treatment. Crossover designs are widespread in software engineering experimentation: they require fewer subjects and control the variability among subjects. However, some researchers disapprove of crossover designs. The main criticisms are: the carryover threat and its troublesome analysis. Carryover is the persistence of the effect of one treatment when another treatment is applied later. It may invalidate the results of an experiment. Additionally, crossover designs are often not properly designed and/or analysed, limiting the validity of the results. In this paper, we aim to make SE researchers aware of the perils of crossover experiments and provide risk avoidance good practices. We study how another discipline (medicine) runs crossover experiments. We review the SE literature and discuss which good practices tend not to be adhered to, giving advice on how they should be applied in SE experiments. We illustrate the concepts discussed analysing a crossover experiment that we have run. We conclude that crossover experiments can yield valid results, provided they are properly designed and analysed, and that, if correctly addressed, carryover is no worse than other validity threats.

[1]  Gustav Levine,et al.  Experimental methods in psychology , 1993 .

[2]  Mark Von Tress,et al.  Generalized, Linear, and Mixed Models , 2003, Technometrics.

[3]  A. Rigby Cross‐over Trials in Clinical Research , 2003 .

[4]  Boris Beizer,et al.  Software testing techniques (2. ed.) , 1990 .

[5]  Barbara A. Kitchenham,et al.  The case against cross-over designs in software engineering , 2003, Eleventh Annual International Workshop on Software Technology and Engineering Practice.

[6]  Claes Wohlin,et al.  Experimentation in Software Engineering , 2000, The Kluwer International Series in Software Engineering.

[7]  P. Freeman The performance of the two-stage analysis of two-treatment, two-period crossover trials. , 1989, Statistics in medicine.

[8]  Victor Pankratius,et al.  Combining functional and imperative programming for multicore software: An empirical study evaluating Scala and Java , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[9]  Guilherme Horta Travassos,et al.  Supporting the Combined Selection of Model-Based Testing Techniques , 2014, IEEE Transactions on Software Engineering.

[10]  Glenford J. Myers,et al.  Art of Software Testing , 1979 .

[11]  Silvia Mara Abrahão,et al.  Assessing the Effectiveness of Sequence Diagrams in the Comprehension of Functional Requirements: Results from a Family of Five Experiments , 2013, IEEE Transactions on Software Engineering.

[12]  G Dangelo,et al.  CARRY-OVER EFFECTS IN BIOEQUIVALENCE STUDIES , 2001, Journal of biopharmaceutical statistics.

[13]  William M. K. Trochim,et al.  Research Methods: The Essential Knowledge Base , 2015 .

[14]  Marco Torchiano,et al.  A family of experiments to assess the effectiveness and efficiency of source code obfuscation techniques , 2013, Empirical Software Engineering.

[15]  R. Kuehl Design of Experiments: Statistical Principles of Research Design and Analysis , 1999 .

[16]  Janice Singer,et al.  Guide to Advanced Empirical Software Engineering , 2007 .

[17]  Sigrid Eldh Software Testing Techniques , 2007 .

[18]  G. Breukelen Analysis of covariance (ANCOVA) , 2010 .

[19]  P. K. Ito 7 Robustness of ANOVA and MANOVA test procedures , 1980 .

[20]  Glenford J. Myers,et al.  The art of software testing (2. ed.) , 2004 .

[21]  J. Matthews,et al.  Serial correlation in the design and analysis of crossover trials , 1991 .

[22]  Natalia Juristo Juzgado,et al.  Understanding replication of experiments in software engineering: A classification , 2014, Inf. Softw. Technol..

[23]  J. Grizzle THE TWO-PERIOD CHANGE-OVER DESIGN AN ITS USE IN CLINICAL TRIALS. , 1965, Biometrics.

[24]  Roberto Latorre,et al.  Effects of Developer Experience on Learning and Applying Unit Test-Driven Development , 2014, IEEE Transactions on Software Engineering.

[25]  W. Shadish,et al.  Experimental and Quasi-Experimental Designs for Generalized Causal Inference , 2001 .

[26]  Fabiano Cutigi Ferrari,et al.  Development of auxiliary functions: Should you be agile? An empirical assessment of pair programming and test-first programming , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[27]  Paul D. Ellis,et al.  The essential guide to effect sizes : statistical power, meta-analysis, and the interpretation of research results , 2010 .

[28]  J. Fleiss,et al.  A critique of recent research on the two-treatment crossover design. , 1989, Controlled clinical trials.

[29]  Mark Anderson,et al.  Design of Experiments: Statistical Principles of Research Design and Analysis , 2001, Technometrics.

[30]  Marina Fruehauf,et al.  Encyclopedia Of Research Design , 2016 .

[31]  Victor R. Basili,et al.  Comparing the Effectiveness of Software Testing Strategies , 1987, IEEE Transactions on Software Engineering.

[32]  Natalia Juristo Juzgado,et al.  Basics of Software Engineering Experimentation , 2010, Springer US.

[33]  Tracy Hall,et al.  Researcher Bias: The Use of Machine Learning in Software Defect Prediction , 2014, IEEE Transactions on Software Engineering.

[34]  Jeffrey C. Carver,et al.  Program comprehension of domain-specific and general-purpose languages: comparison using a family of experiments , 2011, Empirical Software Engineering.

[35]  Dietmar Pfahl,et al.  Reporting Experiments in Software Engineering , 2008, Guide to Advanced Empirical Software Engineering.

[36]  M. Hills,et al.  The two-period cross-over clinical trial. , 1979, British journal of clinical pharmacology.

[37]  James Miller,et al.  An empirical evaluation of defect detection techniques , 1997, Inf. Softw. Technol..

[38]  A P Grieve,et al.  A Bayesian analysis of the two-period crossover design for clinical trials. , 1985, Biometrics.

[39]  Erik Kamsties,et al.  An Empirical Evaluation of Three Defect-Detection Techniques , 1995, ESEC.

[40]  Dawn J. Lawrie,et al.  The impact of identifier style on effort and comprehension , 2012, Empirical Software Engineering.