iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests

Regression testing is increasingly important with the wide use of continuous integration. A desirable requirement for regression testing is that a test failure reliably indicates a problem in the code under test and not a false alarm from the test code or the testing infrastructure. However, some test failures are unreliable, stemming from flaky tests that can nondeterministically pass or fail for the same code under test. There are many types of flaky tests, with order-dependent tests being a prominent type. To help advance research on flaky tests, we present (1) a framework, iDFlakies, to detect and partially classify flaky tests; (2) a dataset of flaky tests in open-source projects; and (3) a study with our dataset. iDFlakies automates experimentation with our tool for Maven-based Java projects. Using iDFlakies, we build a dataset of 422 flaky tests, with 50.5% order-dependent and 49.5% not. Our study of these flaky tests finds the prevalence of two types of flaky tests, probability of a test-suite run to have at least one failure due to flaky tests, and how different test reorderings affect the number of detected flaky tests. We envision that our work can spur research to alleviate the problem of flaky tests.

[1]  Lionel C. Briand,et al.  Automating regression test selection based on UML designs , 2009, Inf. Softw. Technol..

[2]  Darko Marinov,et al.  Evaluating test-suite reduction in real software evolution , 2018, ISSTA.

[3]  Darko Marinov,et al.  A Large-Scale Study of Test Coverage Evolution , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Gregg Rothermel,et al.  Techniques for improving regression testing in continuous integration development environments , 2014, SIGSOFT FSE.

[5]  Nikolaj Bjørner,et al.  Optimizing Test Placement for Module-Level Regression Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[6]  Andy Zaidman,et al.  Does Refactoring of Test Smells Induce Fixing Flaky Tests? , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[7]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[8]  Gregg Rothermel,et al.  Supporting Controlled Experimentation with Testing Techniques: An Infrastructure and its Potential Impact , 2005, Empirical Software Engineering.

[9]  Chen Huo,et al.  Improving oracle quality by detecting brittle assertions and unused inputs in tests , 2014, FSE 2014.

[10]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[11]  Laurie A. Williams,et al.  Continuous Deployment at Facebook and OANDA , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C).

[12]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[13]  W. Eric Wong,et al.  Effect of test set minimization on fault detection effectiveness , 1998 .

[14]  Darko Marinov,et al.  Practical regression test selection with dynamic file dependencies , 2015, ISSTA.

[15]  Alessandro Orso,et al.  Regression testing in the presence of non-code changes , 2011, 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation.

[16]  Cristian Cadar,et al.  Covrig: a framework for the analysis of code, test, and coverage evolution in real software , 2014, ISSTA 2014.

[17]  Darko Marinov,et al.  Usage, costs, and benefits of continuous integration in open-source projects , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[18]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[19]  Darko Marinov,et al.  Reliable testing: detecting state-polluting tests to prevent test dependency , 2015, ISSTA.

[20]  Alessandro Orso,et al.  Scaling regression testing to large software systems , 2004, SIGSOFT '04/FSE-12.

[21]  John Micco,et al.  Taming Google-Scale Continuous Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[22]  Kivanç Muslu,et al.  Finding bugs by isolating unit tests , 2011, ESEC/FSE '11.

[23]  Xiaochen Li,et al.  What Causes My Test Alarm? Automatic Cause Analysis for Test Alarms in System and Integration Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE).

[24]  Michael D. Ernst,et al.  Defects4J: a database of existing faults to enable controlled testing studies for Java programs , 2014, ISSTA 2014.

[25]  Gail E. Kaiser,et al.  Efficient dependency detection for safe Java test acceleration , 2015, ESEC/SIGSOFT FSE.

[26]  Gail E. Kaiser,et al.  Unit test virtualization with VMVM , 2014, ICSE.

[27]  Myra B. Cohen,et al.  Making system user interactive tests repeatable: when and what should we control? , 2015, ICSE 2015.

[28]  Sarfraz Khurshid,et al.  Regression mutation testing , 2012, ISSTA 2012.

[29]  Brendan Murphy,et al.  The Art of Testing Less without Sacrificing Quality , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[30]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[31]  Reid Holmes,et al.  Measuring the cost of regression testing in practice: a study of Java projects using continuous integration , 2017, ESEC/SIGSOFT FSE.

[32]  Gregg Rothermel,et al.  An empirical study of the effects of minimization on the fault detection capabilities of test suites , 1998, Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272).

[33]  Myra B. Cohen,et al.  Automated testing of GUI applications: Models, tools, and controlling flakiness , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[34]  Celal Ziftci,et al.  Who Broke the Build? Automatically Identifying Changes That Induce Test Failures in Continuous Integration at Google Scale , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[35]  Michael D. Ernst,et al.  When Tests Collide: Evaluating and Coping with the Impact of Test Dependence , 2015 .

[36]  Gregg Rothermel,et al.  On test suite composition and cost-effective regression testing , 2004, TSEM.

[37]  Darko Marinov,et al.  An extensive study of static regression test selection in modern software evolution , 2016, SIGSOFT FSE.

[38]  Darko Marinov,et al.  Balancing trade-offs in test-suite reduction , 2014, SIGSOFT FSE.

[39]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).