Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects

Flaky tests are tests that can non-deterministically pass and fail. They pose a major impediment to regression testing, because they provide an inconclusive assessment on whether recent code changes contain faults or not. Prior studies of flaky tests have proposed tools to detect flaky tests and identified various sources of flakiness in tests, e.g., order-dependent (OD) tests that deterministically fail for some order of tests in a test suite but deterministically pass for some other orders. Several of these studies have focused on OD tests. We focus on an important and under-explored source of flakiness in tests: non-order-dependent tests that can nondeterministically pass and fail even for the same order of tests. Instead of using specialized tools that aim to detect flaky tests, we run tests using the tool configured by the developers. Specifically, we perform our empirical evaluation on Java projects that rely on the Maven Surefire plugin to run tests. We re-execute each test suite 4000 times, potentially in different test-class orders, and we label tests as flaky if our runs have both pass and fail outcomes across these reruns. We obtain a dataset of 107 flaky tests and study various characteristics of these tests. We find that many tests previously called “non-order-dependent” actually do depend on the order and can fail with very different failure rates for different orders.

[1]  Teng Long,et al.  Modeling and Ranking Flaky Tests at Apple , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[2]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[3]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[4]  Brendan Murphy,et al.  The Art of Testing Less without Sacrificing Quality , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[5]  Zebao Gao,et al.  Making System User Interactive Tests Repeatable: When and What Should we Control? , 2015, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[6]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[7]  Saikat Dutta,et al.  Detecting flaky tests in probabilistic and machine learning applications , 2020, ISSTA.

[8]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[9]  Hitesh Sajnani,et al.  A Study on the Lifecycle of Flaky Tests , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[10]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[11]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[12]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[13]  Christoph Treude,et al.  What is the Vocabulary of Flaky Tests? , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[14]  John Micco,et al.  The State of Continuous Integration Testing @Google , 2017 .

[15]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[16]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[17]  Celal Ziftci,et al.  Who Broke the Build? Automatically Identifying Changes That Induce Test Failures in Continuous Integration at Google Scale , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[18]  Filomena Ferrucci,et al.  A Container-Based Infrastructure for Fuzzy-Driven Root Causing of Flaky Tests , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER).

[19]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[20]  Chen Huo,et al.  Improving oracle quality by detecting brittle assertions and unused inputs in tests , 2014, FSE 2014.

[21]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[22]  Marshall Copeland,et al.  Microsoft Azure , 2015, Apress.

[23]  Darko Marinov,et al.  Reliable testing: detecting state-polluting tests to prevent test dependency , 2015, ISSTA.

[24]  Xiaochen Li,et al.  What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[25]  Michael D. Ernst,et al.  Dependent-test-aware regression testing techniques , 2020, ISSTA.

[26]  John Micco,et al.  Taming Google-Scale Continuous Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[27]  Darko Marinov,et al.  Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).