FlakeFlagger: Predicting Flakiness Without Rerunning Tests

When developers make changes to their code, they typically run regression tests to detect if their recent changes (re) introduce any bugs. However, many tests are flaky, and their outcomes can change non-deterministically, failing without apparent cause. Flaky tests are a significant nuisance in the development process, since they make it more difficult for developers to trust the outcome of their tests, and hence, it is important to know which tests are flaky. The traditional approach to identify flaky tests is to rerun them multiple times: if a test is observed both passing and failing on the same code, it is definitely flaky. We conducted a very large empirical study looking for flaky tests by rerunning the test suites of 24 projects 10,000 times each, and found that even with this many reruns, some previously identified flaky tests were still not detected. We propose FlakeFlagger, a novel approach that collects a set of features describing the behavior of each test, and then predicts tests that are likely to be flaky based on similar behavioral features. We found that FlakeFlagger correctly labeled as flaky at least as many tests as a state-of-the-art flaky test classifier, but that FlakeFlagger reported far fewer false positives. This lower false positive rate translates directly to saved time for researchers and developers who use the classification result to guide more expensive flaky test detection processes. Evaluated on our dataset of 23 projects with flaky tests, FlakeFlagger outperformed the prior approach (by F1 score) on 16 projects and tied on 4 projects. Our results indicate that this approach can be effective for identifying likely flaky tests prior to running time-consuming flaky test detectors.

[1]  Rafael Serapilha Durelli,et al.  Machine Learning Applied to Software Testing: A Systematic Mapping Study , 2019, IEEE Transactions on Reliability.

[2]  Per Erik Strandberg,et al.  Intermittently failing tests in the embedded systems domain , 2020, ISSTA.

[3]  Lin Shi,et al.  Machine learning techniques for code smell detection: A systematic literature review and meta-analysis , 2019, Inf. Softw. Technol..

[4]  Andrea De Lucia,et al.  Detecting code smells using machine learning techniques: Are we there yet? , 2018, 2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengineering (SANER).

[5]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[6]  Salman Abdul Moiz,et al.  Code smell detection using multi-label classification approach , 2019, Software Quality Journal.

[7]  Gail E. Kaiser,et al.  Efficient dependency detection for safe Java test acceleration , 2015, ESEC/SIGSOFT FSE.

[8]  Darko Marinov,et al.  Understanding Reproducibility and Characteristics of Flaky Tests Through Test Reruns in Java Projects , 2020, 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE).

[9]  Mohamed Wiem Mkaouer,et al.  tsDetect: an open source test smells detection tool , 2020, ESEC/SIGSOFT FSE.

[10]  Gabriele Bavota,et al.  Are test smells really harmful? An empirical study , 2014, Empirical Software Engineering.

[11]  Xiaochen Li,et al.  What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[12]  Mark Harman,et al.  FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair , 2019, ArXiv.

[13]  Nitesh V. Chawla,et al.  Data Mining for Imbalanced Datasets: An Overview , 2005, The Data Mining and Knowledge Discovery Handbook.

[14]  Yoshua Bengio,et al.  No Unbiased Estimator of the Variance of K-Fold Cross-Validation , 2003, J. Mach. Learn. Res..

[15]  Gerard Meszaros,et al.  xUnit Test Patterns: Refactoring Test Code , 2007 .

[16]  Lior Rokach,et al.  Data Mining And Knowledge Discovery Handbook , 2005 .

[17]  Sebastian G. Elbaum,et al.  Test Analysis: Searching for Faults in Tests (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Arie van Deursen,et al.  Refactoring test code , 2001 .

[19]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[20]  Mika Mäntylä,et al.  Code Smell Detection: Towards a Machine Learning-Based Approach , 2013, 2013 IEEE International Conference on Software Maintenance.

[21]  Rudolf Ramler,et al.  Automated Static Analysis of Unit Test Code , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[22]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[23]  Amanpreet Singh,et al.  A review of supervised machine learning algorithms , 2016, 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom).

[24]  Chen Huo,et al.  Improving oracle quality by detecting brittle assertions and unused inputs in tests , 2014, FSE 2014.

[25]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[26]  Antonia Bertolino,et al.  Know You Neighbor: Fast Static Prediction of Test Flakiness , 2021, IEEE Access.

[27]  Darko Marinov,et al.  Detecting Assumptions on Deterministic Implementations of Non-deterministic Specifications , 2016, 2016 IEEE International Conference on Software Testing, Verification and Validation (ICST).

[28]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[29]  S. Kotsiantis,et al.  Discretization Techniques: A recent survey , 2006 .

[30]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[31]  Azeem Ahmad,et al.  Empirical analysis of practitioners' perceptions of test flakiness factors , 2019, Softw. Test. Verification Reliab..

[32]  Darko Marinov,et al.  Mitigating the effects of flaky tests on mutation testing , 2019, ISSTA.

[33]  Gail E. Kaiser,et al.  Unit test virtualization with VMVM , 2014, ICSE.

[34]  Arie van Deursen,et al.  Automated Detection of Test Fixture Strategies and Smells , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[35]  Vahid Garousi,et al.  Smells in software test code: A survey of knowledge in industry and academia , 2018, J. Syst. Softw..

[36]  Michael Hilton,et al.  FlakeFlagger: Predicting Flakiness Without Rerunning Tests , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering: Companion Proceedings (ICSE-Companion).

[37]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[38]  Darko Marinov,et al.  Reliable testing: detecting state-polluting tests to prevent test dependency , 2015, ISSTA.

[39]  Tao Xie,et al.  A large-scale longitudinal study of flaky tests , 2020, Proc. ACM Program. Lang..

[40]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[41]  Emanuel Irrazábal,et al.  Identifying Key Success Factors in Stopping Flaky Tests in Automated REST Service Testing , 2018 .

[42]  Ron Kohavi,et al.  Supervised and Unsupervised Discretization of Continuous Features , 1995, ICML.

[43]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[44]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[45]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[46]  Tariq M. King,et al.  Towards a Bayesian Network Model for Predicting Flaky Automated Tests , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[47]  Andrea De Lucia,et al.  Improving change prediction models with code smell-related information , 2019, Empirical Software Engineering.

[48]  Bart Van Rompaey,et al.  TestQ: Exploring Structural and Maintenance Characteristics of Unit Test Suites , 2008 .

[49]  Shang Lei,et al.  A Feature Selection Method Based on Information Gain and Genetic Algorithm , 2012, 2012 International Conference on Computer Science and Electronics Engineering.

[50]  Christoph Treude,et al.  What is the Vocabulary of Flaky Tests? , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[51]  Andrea De Lucia,et al.  Automatic Test Smell Detection Using Information Retrieval Techniques , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).