Discerning Legitimate Failures From False Alerts: A Study of Chromium's Continuous Integration

Flakiness is a major concern in Software testing. Flaky tests pass and fail for the same version of a program and mislead developers who spend time and resources investigating test failures only to discover that they are false alerts. In practice, the defacto approach to address this concern is to rerun failing tests hoping that they would pass and manifest as false alerts. Nonetheless, completely filtering out false alerts may require a disproportionate number of reruns, and thus incurs important costs both computation and time-wise. As an alternative to reruns, we propose Fair, a novel lightweight approach that classifies test failures into false alerts and legitimate failures. Fair relies on a classifier and a set of features from the failures and test artefacts. To build and evaluate our machine learning classifier, we use the continuous integration of the Chromium project. In particular, we collect the properties and artefacts of more than 1 million test failures from 2,000 builds. Our results show that Fair can accurately distinguish legitimate failures from false alerts, with an MCC up to 95%. Moreover, by studying different test categories: GUI, integration and unit tests, we show that Fair classifies failures accurately even when the number of failures is limited. Finally, we compare the costs of our approach to reruns and show that Fair could save up to 20 minutes of computation time per build.

[1]  Yves Le Traon,et al.  Assessing Transition-Based Test Selection Algorithms at Google , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[2]  Teng Long,et al.  Modeling and Ranking Flaky Tests at Apple , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP).

[3]  D. Chicco,et al.  The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation , 2020, BMC Genomics.

[4]  Marco Aurélio Graciotto Silva,et al.  What is the Vocabulary of Flaky Tests? An Extended Replication , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[5]  John Micco,et al.  The State of Continuous Integration Testing @Google , 2017 .

[6]  Christoph Treude,et al.  What is the Vocabulary of Flaky Tests? , 2020, 2020 IEEE/ACM 17th International Conference on Mining Software Repositories (MSR).

[7]  Shane McIntosh,et al.  Quantifying, Characterizing, and Mitigating Flakily Covered Program Elements , 2022, IEEE Transactions on Software Engineering.

[8]  Wei Yang,et al.  An Empirical Analysis of UI-Based Flaky Tests , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[9]  Nachiappan Nagappan,et al.  Empirically Detecting False Test Alarms Using Association Rules , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[10]  Suman Nath,et al.  Root causing flaky tests in a large-scale industrial setting , 2019, ISSTA.

[11]  Mark Goadrich,et al.  The relationship between Precision-Recall and ROC curves , 2006, ICML.

[12]  Na Meng,et al.  An Empirical Study of Flaky Tests in Android Apps , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[13]  Sasa Misailovic,et al.  Detecting flaky tests in probabilistic and machine learning applications , 2020, International Symposium on Software Testing and Analysis.

[14]  Md Tajmilur Rahman,et al.  The impact of failing, flaky, and high failure tests on the number of crash reports associated with Firefox builds , 2018, ESEC/SIGSOFT FSE.

[15]  Wing Lam,et al.  iDFlakies: A Framework for Detecting and Partially Classifying Flaky Tests , 2019, 2019 12th IEEE Conference on Software Testing, Validation and Verification (ICST).

[16]  Tariq M. King,et al.  Towards a Bayesian Network Model for Predicting Flaky Automated Tests , 2018, 2018 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C).

[17]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[18]  Tao Xie,et al.  A large-scale longitudinal study of flaky tests , 2020, Proc. ACM Program. Lang..

[19]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[20]  Gordon Fraser,et al.  An Empirical Study of Flaky Tests in Python , 2021, 2021 14th IEEE Conference on Software Testing, Verification and Validation (ICST).

[21]  Darko Marinov,et al.  DeFlaker: Automatically Detecting Flaky Tests , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[22]  Tao Xie,et al.  iFixFlakies: a framework for automatically fixing order-dependent flaky tests , 2019, ESEC/SIGSOFT FSE.

[23]  Michael Hilton,et al.  FlakeFlagger: Predicting Flakiness Without Rerunning Tests , 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[24]  Meng Zhang,et al.  Neural Network Methods for Natural Language Processing , 2017, Computational Linguistics.

[25]  Fabio Palomba,et al.  Understanding flaky tests: the developer’s perspective , 2019, ESEC/SIGSOFT FSE.

[26]  Yves Le Traon,et al.  A Replication Study on the Usability of Code Vocabulary in Predicting Flaky Tests , 2021, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).