Coverage and Its Discontents

Everyone wants to know one thing about a test suite: will it detect enough bugs? Unfortunately, in most settings that matter, answering this question directly is impractical or impossible. Software engineers and researchers therefore tend to rely on various measures of code coverage (where mutation testing is considered a form of syntactic coverage). A long line of academic research efforts have attempted to determine whether relying on coverage as a substitute for fault detection is a reasonable solution to the problems of test suite evaluation. This essay argues that the profusion of coverage-related literature is in part a sign of an underlying uncertainty as to what exactly it is that measuring coverage should achieve, as well as how we would know if it can, in fact, achieve it. We propose some solutions and mitigations, but the primary focus of this essay is to clarify the state of current confusions regarding this key problem for effective software testing.

[1]  Fabio Del Frate,et al.  On the correlation between code coverage and software reliability , 1995, Proceedings of Sixth International Symposium on Software Reliability Engineering. ISSRE'95.

[2]  R. Lipton,et al.  Mutation analysis , 1998 .

[3]  Matthew B. Dwyer,et al.  Controlling factors in evaluating path-sensitive error detection techniques , 2006, SIGSOFT '06/FSE-14.

[4]  Richard G. Hamlet,et al.  Testing Programs with the Aid of a Compiler , 1977, IEEE Transactions on Software Engineering.

[5]  Klaus Havelund,et al.  Model Checking Programs , 2004, Automated Software Engineering.

[6]  Alex Groce,et al.  Guidelines for Coverage-Based Comparisons of Non-Adequate Test Suites , 2015, ACM Trans. Softw. Eng. Methodol..

[7]  Alex Groce (Quickly) testing the tester via path coverage , 2009, WODA '09.

[8]  Luciano Baresi,et al.  An Introduction to Software Testing , 2006, FoVMT.

[9]  Gregg Rothermel,et al.  Empirical Studies of a Safe Regression Test Selection Technique , 1998, IEEE Trans. Software Eng..

[10]  Michael D. Ernst,et al.  Improving test suites via operational abstraction , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[11]  Alex Groce,et al.  You Are the Only Possible Oracle: Effective Test Selection for End Users of Interactive Machine Learning Systems , 2014, IEEE Transactions on Software Engineering.

[12]  Alex Groce,et al.  Randomized Differential Testing as a Prelude to Formal Verification , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Chen Fu,et al.  CarFast: achieving higher statement coverage faster , 2012, SIGSOFT FSE.

[14]  Gregg Rothermel,et al.  A comparative study of coarse- and fine-grained safe regression test-selection techniques , 2001, TSEM.

[15]  Alex Groce,et al.  Taming compiler fuzzers , 2013, ACM-SIGPLAN Symposium on Programming Language Design and Implementation.

[16]  Reid Holmes,et al.  Coverage is not strongly correlated with test suite effectiveness , 2014, ICSE.

[17]  Hoyt Lougee,et al.  SOFTWARE CONSIDERATIONS IN AIRBORNE SYSTEMS AND EQUIPMENT CERTIFICATION , 2001 .

[18]  Mark Harman,et al.  A Comprehensive Survey of Trends in Oracles for Software Testing , 2013 .

[19]  Boris Beizer,et al.  Software testing techniques (2. ed.) , 1990 .

[20]  Atul Gupta,et al.  An Experimental Comparison of the Effectiveness of Control Flow Based Testing Approaches on Seeded Faults , 2006, TACAS.

[21]  Alex Groce,et al.  Cause Reduction for Quick Testing , 2014, 2014 IEEE Seventh International Conference on Software Testing, Verification and Validation.

[22]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[23]  Gregg Rothermel,et al.  Selecting a Cost-Effective Test Case Prioritization Technique , 2004, Software Quality Journal.

[24]  Koushik Sen,et al.  CUTE: a concolic unit testing engine for C , 2005, ESEC/FSE-13.

[25]  Alex Groce,et al.  An Improved Memetic Algorithm with Method Dependence Relations (MAMDR) , 2014, 2014 14th International Conference on Quality Software.

[26]  Mark Harman,et al.  Higher Order Mutation Testing , 2009, Inf. Softw. Technol..

[27]  Alex Groce,et al.  Using test case reduction and prioritization to improve symbolic execution , 2014, ISSTA 2014.

[28]  Koushik Sen,et al.  DART: directed automated random testing , 2005, PLDI '05.

[29]  Gordon Fraser,et al.  Testing Container Classes: Random or Systematic? , 2011, FASE.

[30]  Adam Kiezun,et al.  Grammar-based whitebox fuzzing , 2008, PLDI '08.

[31]  Dietmar Pfahl,et al.  Using simulation for assessing the real impact of test-coverage on defect-coverage , 2000, IEEE Trans. Reliab..

[32]  Nicholas Nethercote,et al.  Valgrind: a framework for heavyweight dynamic binary instrumentation , 2007, PLDI '07.

[33]  Akbar Siami Namin,et al.  The influence of size and coverage on test suite effectiveness , 2009, ISSTA.

[34]  Corina S. Pasareanu,et al.  Test input generation for java containers using state matching , 2006, ISSTA '06.

[35]  James H. Andrews,et al.  Comparing Multi-Point Stride Coverage and dataflow coverage , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[36]  Alex Groce,et al.  Mutations: How Close are they to Real Faults? , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[37]  Phyllis G. Frankl,et al.  An Experimental Comparison of the Effectiveness of Branch Testing and Data Flow Testing , 1993, IEEE Trans. Software Eng..

[38]  Alex Groce,et al.  Code coverage for suite evaluation by developers , 2014, ICSE.

[39]  Alex Groce,et al.  Comparing non-adequate test suites using coverage criteria , 2013, ISSTA.

[40]  Cem Kaner,et al.  Lessons Learned in Software Testing , 2001 .

[41]  Phil McMinn,et al.  Search‐based software test data generation: a survey , 2004, Softw. Test. Verification Reliab..

[42]  Alex Groce,et al.  Lightweight Automated Testing with Adaptation-Based Programming , 2012, 2012 IEEE 23rd International Symposium on Software Reliability Engineering.

[43]  Gregg Rothermel,et al.  Empirical studies of test‐suite reduction , 2002, Softw. Test. Verification Reliab..

[44]  Lionel C. Briand,et al.  A practical guide for using statistical tests to assess randomized algorithms in software engineering , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[45]  Phyllis G. Frankl,et al.  Further empirical studies of test effectiveness , 1998, SIGSOFT '98/FSE-6.

[46]  Gregory Gay,et al.  On the Danger of Coverage Directed Test Case Generation , 2012, FASE.

[47]  Lionel C. Briand,et al.  Using Mutation Analysis for Assessing and Comparing Testing Coverage Criteria , 2006, IEEE Transactions on Software Engineering.

[48]  Alex Groce,et al.  Swarm testing , 2012, ISSTA 2012.

[49]  Steve McConnell,et al.  Code Complete, Second Edition , 2004 .

[50]  Bertrand Meyer,et al.  Is Branch Coverage a Good Measure of Testing Effectiveness? , 2010, LASER Summer School.

[51]  Mark Harman,et al.  Regression testing minimization, selection and prioritization: a survey , 2012, Softw. Test. Verification Reliab..

[52]  Michael D. Ernst,et al.  Are mutants a valid substitute for real faults in software testing? , 2014, SIGSOFT FSE.

[53]  Atul Gupta,et al.  An approach for experimentally evaluating effectiveness and efficiency of coverage criteria for software testing , 2008, International Journal on Software Tools for Technology Transfer.

[54]  Joseph Robert Horgan,et al.  Effect of test set size and block coverage on the fault detection effectiveness , 1994, Proceedings of 1994 IEEE International Symposium on Software Reliability Engineering.

[55]  Sigrid Eldh Software Testing Techniques , 2007 .

[56]  Barton P. Miller,et al.  An empirical study of the reliability of UNIX utilities , 1990, Commun. ACM.