Quantifying, Characterizing, and Mitigating Flakily Covered Program Elements

Code coverage measures the degree to which source code elements (e.g., statements, branches) are invoked during testing. Despite growing evidence that coverage is a problematic measurement, it is often used to make decisions about where testing effort should be invested. For example, using coverage as a guide, tests should be written to invoke the non-covered program elements. At their core, coverage measurements assume that invocation of a program element during any test is equally valuable. Yet in reality, some tests are more robust than others. As a concrete instance of this, we posit in this paper that program elements that are only covered by flaky tests, i.e., tests with non-deterministic behaviour, are also worthy of investment of additional testing effort. In this paper, we set out to quantify, characterize, and mitigate “flakily covered” program elements (i.e., those elements that are only covered by flaky tests). To that end, we perform an empirical study of three large software systems from the OpenStack community. In terms of quantification, we find that systems are disproportionately impacted by flakily covered statements with 5% and 10% of the covered statements in Nova and Neutron being flakily covered, respectively, while < 1% of Cinder statements are flakily covered. In terms of characterization, we find that incidences of flakily covered statements could not be well explained by solely using code characteristics, such as dispersion, ownership, and development activity. In terms of mitigation, we propose GreedyFlake – a test effort prioritization algorithm to maximize return on investment when tackling the problem of flakily covered program elements. We find that GreedyFlake outperforms baseline approaches by at least eight percentage points of Area Under the Cost Effectiveness Curve.

[1]  Gregg Rothermel,et al.  The impact of software evolution on code coverage information , 2001, Proceedings IEEE International Conference on Software Maintenance. ICSM 2001.

[2]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[3]  Gregory Gay,et al.  The Fitness Function for the Job: Search-Based Generation of Test Suites That Detect Real Faults , 2017, 2017 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[4]  Niclas Ohlsson,et al.  Predicting Fault-Prone Software Modules in Telephone Switches , 1996, IEEE Trans. Software Eng..

[5]  Ben Shneiderman,et al.  Tree visualization with tree-maps: 2-d space-filling approach , 1992, TOGS.

[6]  Gregg Rothermel,et al.  Empirical studies of test case prioritization in a JUnit testing environment , 2004, 15th International Symposium on Software Reliability Engineering.

[7]  P. Riehmann,et al.  Interactive Sankey diagrams , 2005, IEEE Symposium on Information Visualization, 2005. INFOVIS 2005..

[8]  Premkumar T. Devanbu,et al.  Ownership, experience and defects: a fine-grained study of authorship , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[9]  Mark Harman,et al.  Search Algorithms for Regression Test Case Prioritization , 2007, IEEE Transactions on Software Engineering.

[10]  Kai-Yuan Cai,et al.  Effective Fault Localization using Code Coverage , 2007, 31st Annual International Computer Software and Applications Conference (COMPSAC 2007).

[11]  Ahmed E. Hassan,et al.  Prioritizing the creation of unit tests in legacy software systems , 2011, Softw. Pract. Exp..

[12]  Alex Groce,et al.  Code coverage for suite evaluation by developers , 2014, ICSE.

[13]  Mary Jean Harrold,et al.  Test-Suite Reduction and Prioritization for Modified Condition/Decision Coverage , 2003, IEEE Trans. Software Eng..

[14]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[15]  Arvinder Kaur,et al.  A GENETIC ALGORITHM FOR REGRESSION TEST CASE PRIORITIZATION USING CODE COVERAGE , 2011 .

[16]  Gregory Gay,et al.  The Risks of Coverage-Directed Test Case Generation , 2015, IEEE Transactions on Software Engineering.

[17]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[18]  Reid Holmes,et al.  Coverage is not strongly correlated with test suite effectiveness , 2014, ICSE.

[19]  Darko Marinov,et al.  Mitigating the effects of flaky tests on mutation testing , 2019, ISSTA.

[20]  Tariq M. King,et al.  Towards a Bayesian Network Model for Predicting Flaky Automated Tests , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[21]  Brian Marick How to Misuse Code Coverage , 1999 .

[22]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[23]  Ying Meng,et al.  Investigating faults missed by test suites achieving high code coverage , 2018, J. Syst. Softw..

[24]  Miroslaw Staron,et al.  Mythical Unit Test Coverage , 2018, IEEE Software.

[25]  J. Jenny Li,et al.  Code-coverage guided prioritized test generation , 2004, Proceedings of the 28th Annual International Computer Software and Applications Conference, 2004. COMPSAC 2004..

[26]  David Lo,et al.  Code Coverage and Postrelease Defects: A Large-Scale Study on Open Source Projects , 2017, IEEE Transactions on Reliability.

[27]  M. Strathern ‘Improving ratings’: audit in the British University system , 1997, European Review.

[28]  Luciano Baresi,et al.  An Introduction to Software Testing , 2006, FoVMT.

[29]  Akbar Siami Namin,et al.  The influence of size and coverage on test suite effectiveness , 2009, ISSTA.

[30]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[31]  Paul Piwowarski,et al.  Coverage measurement experience during function test , 1993, Proceedings of 1993 15th International Conference on Software Engineering.