An Empirical Study of Flaky Tests in Python

Tests that cause spurious failures without any code changes, i.e., flaky tests, hamper regression testing, increase maintenance costs, may shadow real bugs, and decrease trust in tests. While the prevalence and importance of flakiness is well established, prior research focused on Java projects, thus raising the question of how the findings generalize. In order to provide a better understanding of the role of flakiness in software development beyond Java, we empirically study the prevalence, causes, and degree of flakiness within software written in Python, one of the currently most popular programming languages. For this, we sampled 22 352 open source projects from the popular PyPI package index, and analyzed their 876 186 test cases for flakiness. Our investigation suggests that flakiness is equally prevalent in Python as it is in Java. The reasons, however, are different: Order dependency is a much more dominant problem in Python, causing 59% of the 7 571 flaky tests in our dataset. Another 28% were caused by test infrastructure problems, which represent a previously undocumented cause of flakiness. The remaining 13% can mostly be attributed to the use of network and randomness APIs by the projects, which is indicative of the type of software commonly written in Python. Our data also suggests that finding flaky tests requires more runs than are often done in the literature: A 95% confidence that a passing test case is not flaky on average would require 170 reruns.

[1]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[2]  Hoang Pham,et al.  Software field failure rate prediction before software deployment , 2006, J. Syst. Softw..

[3]  Mayur Naik,et al.  From symptom to cause: localizing errors in counterexample traces , 2003, POPL '03.

[4]  Henry Coles,et al.  Demo: PIT a Practical Mutation Testing Tool for Java , .

[5]  Md Tajmilur Rahman,et al.  The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds , 2018, ESEC/SIGSOFT FSE.

[6]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[7]  Mark Harman,et al.  FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair , 2019, ArXiv.

[8]  Celal Ziftci,et al.  De-Flake Your Tests : Automatically Locating Root Causes of Flaky Tests in Code At Google , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[9]  Dirk Beyer,et al.  Benchmarking and Resource Measurement , 2015, SPIN.

[10]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[11]  Andy B. Yoo,et al.  Approved for Public Release; Further Dissemination Unlimited X-ray Pulse Compression Using Strained Crystals X-ray Pulse Compression Using Strained Crystals , 2002 .

[12]  Gordon Fraser,et al.  An Empirical Study of Flaky Tests in Python , 2022, Software Engineering.

[13]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[14]  David Maciver,et al.  Hypothesis: A new approach to property-based testing , 2019, J. Open Source Softw..

[15]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[16]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[17]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[18]  Darko Marinov,et al.  Mitigating the effects of flaky tests on mutation testing , 2019, ISSTA.

[19]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[20]  John Micco,et al.  The State of Continuous Integration Testing @Google , 2017 .

[21]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.