An empirical analysis of flaky tests

Regression testing is a crucial part of software development. It checks that software changes do not break existing functionality. An important assumption of regression testing is that test outcomes are deterministic: an unmodified test is expected to either always pass or always fail for the same code under test. Unfortunately, in practice, some tests often called flaky tests—have non-deterministic outcomes. Such tests undermine the regression testing as they make it difficult to rely on test results. We present the first extensive study of flaky tests. We study in detail a total of 201 commits that likely fix flaky tests in 51 open-source projects. We classify the most common root causes of flaky tests, identify approaches that could manifest flaky behavior, and describe common strategies that developers use to fix flaky tests. We believe that our insights and implications can help guide future research on the important topic of (avoiding) flaky tests.

[1]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[2]  Eitan Farchi,et al.  Concurrent bug patterns and how to test them , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[3]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.

[4]  Ying Zou,et al.  An Empirical Study on Inconsistent Changes to Code Clones at Release Level , 2009, 2009 16th Working Conference on Reverse Engineering.

[5]  Grigore Rosu,et al.  Improved multithreaded unit testing , 2011, ESEC/FSE '11.

[6]  Francis J. Lacoste Killing the Gatekeeper: Introducing a Continuous Integration System , 2009, 2009 Agile Conference.

[7]  Ying Zou,et al.  An Empirical Study on Inconsistent Changes to Code Clones at Release Level , 2009, 2009 16th Working Conference on Reverse Engineering.

[8]  Eitan Farchi,et al.  Framework for testing multi‐threaded Java programs , 2003, Concurr. Comput. Pract. Exp..

[9]  Darko Marinov,et al.  ReAssert: Suggesting Repairs for Broken Unit Tests , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[10]  Kivanç Muslu,et al.  Finding bugs by isolating unit tests , 2011, ESEC/FSE '11.

[11]  Sarfraz Khurshid,et al.  Specification-Based Test Repair Using a Lightweight Formal Method , 2012, FM.

[12]  Philip J. Guo,et al.  Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[13]  Elad Yom-Tov,et al.  Instrumenting where it hurts: an automatic concurrent debugging technique , 2007, ISSTA '07.

[14]  Zhendong Su,et al.  Automatic detection of floating-point exceptions , 2013, POPL.

[15]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[16]  Barton P. Miller,et al.  An empirical study of the reliability of UNIX utilities , 1990, Commun. ACM.

[17]  Myra B. Cohen,et al.  Automated testing of GUI applications: Models, tools, and controlling flakiness , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[18]  Emerson R. Murphy-Hill,et al.  The design of bug fixes , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[19]  Yuanyuan Zhou,et al.  Have things changed now?: an empirical study of bug characteristics in modern open source software , 2006, ASID '06.

[20]  Yu Lin,et al.  CHECK-THEN-ACT Misuse of Java Concurrent Collections , 2013, 2013 IEEE Sixth International Conference on Software Testing, Verification and Validation.

[21]  Shan Lu,et al.  Automated atomicity-violation fixing , 2011, PLDI '11.

[22]  Cristian Cadar,et al.  Covrig: a framework for the analysis of code, test, and coverage evolution in real software , 2014, ISSTA 2014.

[23]  Gail E. Kaiser,et al.  Unit test virtualization with VMVM , 2014, ICSE.