What is the Vocabulary of Flaky Tests? An Extended Replication

Software systems have been continuously evolved and delivered with high quality due to the widespread adoption of automated tests. A recurring issue hurting this scenario is the presence of flaky tests, a test case that may pass or fail non-deterministically. A promising, but yet lacking more empirical evidence, approach is to collect static data of automated tests and use them to predict their flakiness. In this paper, we conducted an empirical study to assess the use of code identifiers to predict test flakiness. To do so, we first replicate most parts of the previous study of Pinto et al. (MSR 2020). This replication was extended by using a different ML Python platform (Scikit-learn) and adding different learning algorithms in the analyses. Then, we validated the performance of trained models using datasets with other flaky tests and from different projects. We successfully replicated the results of Pinto et al. (2020), with minor differences using Scikit-learn; different algorithms had performance similar to the ones used previously. Concerning the validation, we noticed that the recall of the trained models was smaller, and classifiers presented a varying range of decreases. This was observed in both intra-project and inter-projects test flakiness prediction.

[1]  Jeffrey C. Carver Towards Reporting Guidelines for Experimental Replications: A Proposal , 2010 .

[2]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[3]  Hitesh Sajnani,et al.  A Study on the Lifecycle of Flaky Tests , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[4]  Gautam Srivastava,et al.  Root causing, detecting, and fixing flaky tests: State of the art and future roadmap , 2020, Softw. Pract. Exp..

[5]  Ronnie E. S. Santos,et al.  Replication of Empirical Studies in Software Engineering: An Update of a Systematic Mapping Study , 2015, 2015 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM).

[6]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[7]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[8]  Ian H. Witten,et al.  Weka: Practical machine learning tools and techniques with Java implementations , 1999 .

[9]  Steve Counsell,et al.  The role and value of replication in empirical software engineering results , 2018, Inf. Softw. Technol..

[10]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[11]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[12]  Fabio Q. B. da Silva,et al.  Replication of empirical studies in software engineering research: a systematic mapping study , 2012, Empirical Software Engineering.

[13]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[14]  Jeffrey C. Carver,et al.  The role of replications in Empirical Software Engineering , 2008, Empirical Software Engineering.

[15]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[16]  Barbara A. Kitchenham,et al.  The role of replications in empirical software engineering—a word of warning , 2008, Empirical Software Engineering.

[17]  Petra Kaufmann,et al.  Experimental And Quasi Experimental Designs For Research , 2016 .

[18]  Lucila Ohno-Machado,et al.  Logistic regression and artificial neural network classification models: a methodology review , 2002, J. Biomed. Informatics.

[19]  Virginijus Marcinkevičius,et al.  Comparison of Naive Bayes, Random Forest, Decision Tree, Support Vector Machines, and Logistic Regression Classifiers for Text Reviews Classification , 2017, Balt. J. Mod. Comput..

[20]  John Micco,et al.  Taming Google-Scale Continuous Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[21]  Christoph Treude,et al.  What is the Vocabulary of Flaky Tests? , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[22]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[23]  Tariq M. King,et al.  Towards a Bayesian Network Model for Predicting Flaky Automated Tests , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).