What is the Vocabulary of Flaky Tests?

Flaky tests are tests whose outcomes are non-deterministic. Despite the recent research activity on this topic, no effort has been made on understanding the vocabulary of flaky tests. This work proposes to automatically classify tests as flaky or not based on their vocabulary. Static classification of flaky tests is important, for example, to detect the introduction of flaky tests and to search for flaky tests after they are introduced in regression test suites. We evaluated performance of various machine learning algorithms to solve this problem. We constructed a data set of flaky and non-flaky tests by running every test case, in a set of 64k tests, 100 times (6.4 million test executions). We then used machine learning techniques on the resulting data set to predict which tests are flaky from their source code. Based on features, such as counting stemmed tokens extracted from source code identifiers, we achieved an F-measure of 0.95 for the identification of flaky tests. The best prediction performance was obtained when using Random Forest and Support Vector Machines. In terms of the code identifiers that are most strongly associated with test flakiness, we noted that job, action, and services are commonly associated with flaky tests. Overall, our results provides initial yet strong evidence that static detection of flaky tests is effective.

[1]  Nauman Bin Ali,et al.  Test-Case Quality - Understanding Practitioners' Perspectives , 2019, PROFES.

[2]  Hyunsook Do Recent Advances in Regression Testing Techniques , 2016, Adv. Comput..

[3]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[4]  Andy Zaidman,et al.  Does Refactoring of Test Smells Induce Fixing Flaky Tests? , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[5]  Arie van Deursen,et al.  Refactoring test code , 2001 .

[6]  Andy Zaidman,et al.  RETRACTED ARTICLE: The smell of fear: on the relation between test smells and flaky tests , 2019, Empirical Software Engineering.

[7]  Franz Wotawa,et al.  Automatic Software Bug Triage System (BTS) Based on Latent Semantic Indexing and Support Vector Machine , 2009, 2009 Fourth International Conference on Software Engineering Advances.

[8]  Tariq M. King,et al.  Towards a Bayesian Network Model for Predicting Flaky Automated Tests , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[9]  Vahid Garousi,et al.  What We Know About Smells in Software Test Code , 2019, IEEE Software.

[10]  M. Maia,et al.  Ranking crowd knowledge to assist software development , 2014, ICPC 2014.

[11]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[12]  Andrea De Lucia,et al.  Automatic Test Smell Detection Using Information Retrieval Techniques , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[13]  Gabriele Bavota,et al.  An empirical investigation into the nature of test smells , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[14]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[15]  Andreas Zeller,et al.  Practical Test Dependency Detection , 2018, 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST).

[16]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[17]  John Micco,et al.  The State of Continuous Integration Testing @Google , 2017 .

[18]  Christoph Treude,et al.  Augmenting API Documentation with Insights from Stack Overflow , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[19]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques with Java implementations , 2002, SGMD.

[20]  Amin Milani Fard,et al.  An empirical study of bugs in test code , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[21]  Yi Zhang,et al.  Classifying Software Changes: Clean or Buggy? , 2008, IEEE Transactions on Software Engineering.

[22]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[23]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[24]  Gabriele Bavota,et al.  An empirical analysis of the distribution of unit test smells and their impact on software maintenance , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[25]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[26]  Antonia Bertolino,et al.  Know You Neighbor: Fast Static Prediction of Test Flakiness , 2021, IEEE Access.

[27]  Peter W. O'Hearn,et al.  From Start-ups to Scale-ups: Opportunities and Open Problems for Static and Dynamic Program Analysis , 2018, 2018 IEEE 18th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[28]  Sebastian G. Elbaum,et al.  Test Analysis: Searching for Faults in Tests (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[29]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).