A Study on the Lifecycle of Flaky Tests

During regression testing, developers rely on the pass or fail outcomes of tests to check whether changes broke existing functionality. Thus, flaky tests, which nondeterministically pass or fail on the same code, are problematic because they provide misleading signals during regression testing. Although flaky tests are the focus of several existing studies, none of them study (1) the reoccurrence, runtimes, and time-before-fix of flaky tests, and (2) flaky tests in-depth on proprietary projects. This paper fills this knowledge gap about flaky tests and investigates whether prior categorization work on flaky tests also apply to proprietary projects. Specifically, we study the lifecycle of flaky tests in six large-scale proprietary projects at Microsoft. We find, as in prior work, that asynchronous calls are the leading cause of flaky tests in these Microsoft projects. Therefore, we propose the first automated solution, called Flakiness and Time Balancer (FaTB), to reduce the frequency of flaky-test failures caused by asynchronous calls. Our evaluation of five such flaky tests shows that FaTB can reduce the running times of these tests by up to 78% without empirically affecting the frequency of their flaky-test failures. Lastly, our study finds several cases where developers claim they “fixed” a flaky test but our empirical experiments show that their changes do not fix or reduce these tests' frequency of flaky-test failures. Future studies should be more cautious when basing their results on changes that developers claim to be “fixes”.

[1]  Kivanç Muslu,et al.  Finding bugs by isolating unit tests , 2011, ESEC/FSE '11.

[2]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[3]  Srikanth Kandula,et al.  CloudBuild: Microsoft's Distributed and Caching Build Service , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[4]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[5]  Andy Zaidman,et al.  Does Refactoring of Test Smells Induce Fixing Flaky Tests? , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6]  Gail E. Kaiser,et al.  VMVM: unit test virtualization for Java , 2014, ICSE Companion.

[7]  Grigore Rosu,et al.  Improved multithreaded unit testing , 2011, ESEC/FSE '11.

[8]  Md Tajmilur Rahman,et al.  The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds , 2018, ESEC/SIGSOFT FSE.

[9]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[10]  Sarfraz Khurshid,et al.  JPR: Replaying JPF Traces Using Standard JVM , 2018, SOEN.

[11]  Darko Marinov,et al.  Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[12]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[13]  Brendan Murphy,et al.  The Art of Testing Less without Sacrificing Quality , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[14]  Zebao Gao,et al.  Making System User Interactive Tests Repeatable: When and What Should we Control? , 2015, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[15]  Koushik Sen,et al.  CONCURRIT: a domain specific language for reproducing concurrency bugs , 2013, PLDI.

[16]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[17]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[18]  Reid Holmes,et al.  Measuring the cost of regression testing in practice: a study of Java projects using continuous integration , 2017, ESEC/SIGSOFT FSE.

[19]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[20]  John Micco,et al.  The State of Continuous Integration Testing @Google , 2017 .

[21]  Francisco Martins,et al.  Cooperari: a tool for cooperative testing of multithreaded Java programs , 2014, PPPJ '14.

[22]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[23]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[24]  Gail E. Kaiser,et al.  Unit test virtualization with VMVM , 2014, ICSE.

[25]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[26]  Grigore Rosu,et al.  EnforceMOP: a runtime property enforcement system for multithreaded programs , 2013, ISSTA.

[27]  Darko Marinov,et al.  Reliable testing: detecting state-polluting tests to prevent test dependency , 2015, ISSTA.