Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites

A fundamental question in software testing research is how to compare test suites, often as a means for comparing test-generation techniques that produce those test suites. Researchers frequently compare test suites by measuring their coverage. A coverage criterion C provides a set of test requirements and measures how many requirements a given suite satisfies. A suite that satisfies 100% of the feasible requirements is called C-adequate. Previous rigorous evaluations of coverage criteria mostly focused on such adequate test suites: given two criteria C and C′, are C-adequate suites on average more effective than C′-adequate suites? However, in many realistic cases, producing adequate suites is impractical or even impossible. This article presents the first extensive study that evaluates coverage criteria for the common case of non-adequate test suites: given two criteria C and C′, which one is better to use to compare test suites? Namely, if suites T1, T2,…,Tn have coverage values c1, c2,…,cn for C and c1′, c2′,…,cn′ for C′, is it better to compare suites based on c1, c2,…,cn or based on c1′, c2′,…,cn′? We evaluate a large set of plausible criteria, including basic criteria such as statement and branch coverage, as well as stronger criteria used in recent studies, including criteria based on program paths, equivalence classes of covered statements, and predicate states. The criteria are evaluated on a set of Java and C programs with both manually written and automatically generated test suites. The evaluation uses three correlation measures. Based on these experiments, two criteria perform best: branch coverage and an intraprocedural acyclic path coverage. We provide guidelines for testing researchers aiming to evaluate test suites using coverage criteria as well as for other researchers evaluating coverage criteria for research use.

[1]  Bertrand Meyer,et al.  Is Branch Coverage a Good Measure of Testing Effectiveness? , 2010, LASER Summer School.

[2]  Xin-She Yang,et al.  Introduction to Algorithms , 2021, Nature-Inspired Optimization Algorithms.

[3]  Trishul M. Chilimbi,et al.  HOLMES: Effective statistical debugging via efficient path profiling , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[4]  Alex Groce,et al.  Predicate Abstraction with Minimum Predicates , 2003, CHARME.

[5]  Martin C. Rinard,et al.  Purity and Side Effect Analysis for Java Programs , 2005, VMCAI.

[6]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[7]  Andreas Zeller,et al.  Checked coverage: an indicator for oracle quality , 2013, Softw. Test. Verification Reliab..

[8]  Gordon Fraser,et al.  Testing Container Classes: Random or Systematic? , 2011, FASE.

[9]  A. Jefferson Offutt,et al.  An Experimental Comparison of Four Unit Test Criteria: Mutation, Edge-Pair, All-Uses and Prime Path Coverage , 2009, 2009 International Conference on Software Testing, Verification, and Validation Workshops.

[10]  Phyllis G. Frankl,et al.  An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing , 1993, IEEE Trans. Software Eng..

[11]  Marcelo F. Frias,et al.  Analysis of invariants for efficient bounded verification , 2010, ISSTA '10.

[12]  Michael D. Ernst,et al.  Improving test suites via operational abstraction , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[13]  Lionel C. Briand,et al.  Is mutation an appropriate tool for testing experiments? , 2005, ICSE.

[14]  J. Guilford,et al.  Fundamental statistics in psychology and education / J.P. Guilford, Benjamin Fruchter , 1956 .

[15]  Lionel C. Briand,et al.  Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria , 2006, IEEE Transactions on Software Engineering.

[16]  Reid Holmes,et al.  Coverage is not strongly correlated with test suite effectiveness , 2014, ICSE.

[17]  Alex Groce,et al.  Swarm testing , 2012, ISSTA 2012.

[18]  Darko Marinov,et al.  A Comparison of Constraint-Based and Sequence-Based Generation of Complex Input Data Structures , 2010, 2010 Third International Conference on Software Testing, Verification, and Validation Workshops.

[19]  James R. Larus,et al.  Efficient path profiling , 1996, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29.

[20]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[21]  W. Eric Wong,et al.  Effect of test set minimization on fault detection effectiveness , 1998 .

[22]  Alex Groce (Quickly) testing the tester via path coverage , 2009, WODA '09.

[23]  Akbar Siami Namin,et al.  Sufficient mutation operators for measuring test effectiveness , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[24]  Phyllis G. Frankl,et al.  Empirical evaluation of the textual differencing regression testing technique , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[25]  Sriram K. Rajamani,et al.  Automatically validating temporal safety properties of interfaces , 2001, SPIN '01.

[26]  Martijn Adolfsen Industrial Validation of Test Coverage Quality , 2011 .

[27]  Gregg Rothermel,et al.  An experimental evaluation of selective mutation , 1993, Proceedings of 1993 15th International Conference on Software Engineering.

[28]  Thomas J. Ostrand,et al.  Experiments on the effectiveness of dataflow- and control-flow-based test adequacy criteria , 1994, Proceedings of 16th International Conference on Software Engineering.

[29]  C. Spearman The proof and measurement of association between two things. , 2015, International journal of epidemiology.

[30]  Thomas Ball,et al.  A Theory of Predicate-Complete Test Coverage and Generation , 2004, FMCO.

[31]  Gregg Rothermel,et al.  Test case prioritization , 2004 .

[32]  Patrice Godefroid,et al.  Compositional dynamic test generation , 2007, POPL '07.

[33]  Michael R. Lyu,et al.  The effect of code coverage on fault detection under different testing profiles , 2005, ACM SIGSOFT Softw. Eng. Notes.

[34]  Yves Le Traon,et al.  Sampling Program Inputs with Mutation Analysis: Going Beyond Combinatorial Interaction Testing , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation.

[35]  N. Cliff Ordinal methods for behavioral data analysis , 1996 .

[36]  Akbar Siami Namin,et al.  The influence of size and coverage on test suite effectiveness , 2009, ISSTA.

[37]  Sarfraz Khurshid,et al.  Operator-based and random mutant selection: Better together , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  M. Kendall A NEW MEASURE OF RANK CORRELATION , 1938 .

[39]  Corina S. Pasareanu,et al.  Test input generation for java containers using state matching , 2006, ISSTA '06.

[40]  Atul Gupta,et al.  An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing , 2008, International Journal on Software Tools for Technology Transfer.

[41]  Herbert L. Costner,et al.  Criteria for Measures of Association , 1965 .

[42]  Ronald L. Rivest,et al.  Introduction to Algorithms, third edition , 2009 .

[43]  Andreas Zeller,et al.  Javalanche: efficient mutation testing for Java , 2009, ESEC/SIGSOFT FSE.

[44]  René Just,et al.  Using Non-redundant Mutation Operators and Test Suite Prioritization to Achieve Efficient and Scalable Mutation Analysis , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[45]  Laura Inozemtseva,et al.  Predicting Test Suite Effectiveness for Java Programs , 2012 .

[46]  Michael D. Ernst,et al.  Feedback-Directed Random Test Generation , 2007, 29th International Conference on Software Engineering (ICSE'07).

[47]  Atanas Rountev,et al.  Precise identification of side-effect-free methods in Java , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[48]  Joseph Robert Horgan,et al.  Effect of test set size and block coverage on the fault detection effectiveness , 1994, Proceedings of 1994 IEEE International Symposium on Software Reliability Engineering.

[49]  Helmut Veith,et al.  Query-Driven Program Testing , 2008, VMCAI.

[50]  James H. Andrews,et al.  Comparing Multi-Point Stride Coverage and dataflow coverage , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[51]  Alex Groce,et al.  Lightweight Automated Testing with Adaptation-Based Programming , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[52]  Tao Wang,et al.  Automated path generation for software fault localization , 2005, ASE '05.

[53]  Richard G. Hamlet,et al.  Testing Programs with the Aid of a Compiler , 1977, IEEE Transactions on Software Engineering.

[54]  Alex Groce,et al.  Explaining abstract counterexamples , 2004, SIGSOFT '04/FSE-12.

[55]  Tao Xie,et al.  Is operator-based mutant selection superior to random mutant selection? , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[56]  Alex Groce,et al.  Code coverage for suite evaluation by developers , 2014, ICSE.

[57]  Alex Groce,et al.  Comparing non-adequate test suites using coverage criteria , 2013, ISSTA.

[58]  Ben Liblit,et al.  Lightweight control-flow instrumentation and postmortem analysis in support of debugging , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[59]  K. Barraclough Eclipse , 2006, BMJ : British Medical Journal.

[60]  MarinovDarko,et al.  Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites , 2015 .

[61]  Mark Harman,et al.  An Analysis and Survey of the Development of Mutation Testing , 2011, IEEE Transactions on Software Engineering.

[62]  Giovanni Denaro,et al.  ACM Transactions on Software Engineering and Methodology : Volume 22, Nomor 4, 2013 , 2014 .

[63]  Richard J. Lipton,et al.  Hints on Test Data Selection: Help for the Practicing Programmer , 1978, Computer.

[64]  Phyllis G. Frankl,et al.  Further empirical studies of test effectiveness , 1998, SIGSOFT '98/FSE-6.

[65]  J. Larus Whole program paths , 1999, PLDI '99.

[66]  Alex Groce,et al.  Randomized Differential Testing as a Prelude to Formal Verification , 2007, 29th International Conference on Software Engineering (ICSE'07).

[67]  Pierre-Etienne Moreau,et al.  A Simple Generic Library for C , 2006, ICSR.

[68]  Alex Groce Coverage rewarded: Test input generation via adaptation-based programming , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[69]  Thomas A. Henzinger,et al.  Lazy abstraction , 2002, POPL '02.

[70]  Sanjay J. Patel,et al.  Increasing the size of atomic instruction blocks using control flow assertions , 2000, MICRO 33.

[71]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[72]  Chen Fu,et al.  Navigating error recovery code in Java applications , 2005, eclipse '05.

[73]  A. Jefferson Offutt,et al.  Introduction to Software Testing , 2008 .

[74]  Yves Le Traon,et al.  Improving test suites for efficient fault localization , 2006, ICSE.

[75]  J. Guilford Fundamental statistics in psychology and education , 1943 .

[76]  George C. Necula,et al.  CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs , 2002, CC.