Quantifying flakiness and Minimizing its effects on Software Testing

Title of dissertation: QUANTIFYING FLAKINESS AND MINIMIZING ITS EFFECTS ON SOFTWARE TESTING Zebao Gao Doctor of Philosophy, 2017 Dissertation directed by: Professor Atif Memon Department of Computer Science In software testing, test inputs are passed into a system under test (SUT); the SUT is executed; and a test oracle checks the outputs against expected values. There are cases when the same test case is executed on the same code of the SUT multiple times, and it passes or fails during different runs. This is the test flakiness problem and such test cases are called flaky tests. The test flakiness problem makes test results and testing techniques unreliable. Flaky tests may be mistakingly labeled as failed, and this will increase not only the number of reported bugs testers need to check, but also the chance to miss real faults. The test flakiness problem is gaining more attention in modern software testing practice where complex interactions are involved in test execution, and this raises several new challenges: What metrics should be used to measure the flakiness of a test case? What are the factors that cause or impact flakiness? And how can the effects of flakiness be reduced or minimized? This research develops a systematic approach to quantitively analyze and minimize the effects of flakiness. This research makes three major contributions. First, a novel entropy-based metric is introduced to quantify the flakiness of different layers of test outputs (such as code coverage, invariants, and GUI state). Second, the impact of a common set of factors on test results in system interactive testing is examined. Last, a new flake filter is introduced to minimize the impact of flakiness by filtering out flaky tests (and test assertions) while retaining bug-revealing ones. Two empirical studies on five open source applications evaluate the new entropy measure, study the causes of flakiness, and evaluate the usefulness of the flake filter. In particular, the first study empirically analyzes the impact of factors including the system platform, Java version, application initial state and tool harness configurations. The results show a large impact on SUTs when these factors were uncontrolled, with as many as 184 lines of code coverage differing between runs of the same test cases, and up to 96% false positives with respect to fault detection. The second study evaluates the effectiveness of the flake filter on the SUTs’ real faults. The results show that 3.83% of flaky assertions can impact 88.59% of test cases, and it is possible to automatically obtain a flake filter that, in some cases, completely eliminates flakiness without comprising fault-detection ability. QUANTIFYING FLAKINESS AND MINIMIZING ITS EFFECTS ON SOFTWARE TESTING

[1]  Atif M. Memon,et al.  GUI ripping: reverse engineering of graphical user interfaces for testing , 2003, 10th Working Conference on Reverse Engineering, 2003. WCRE 2003. Proceedings..

[2]  Atif M. Memon,et al.  GUITAR: an innovative tool for automated testing of GUI-driven software , 2014, Automated Software Engineering.

[3]  J. R. Quinlan Induction of decision trees , 2004, Machine Learning.

[4]  Kivanç Muslu,et al.  Finding bugs by isolating unit tests , 2011, ESEC/FSE '11.

[5]  Nikolai Tillmann,et al.  Mock-object generation with behavior , 2006, 21st IEEE/ACM International Conference on Automated Software Engineering (ASE'06).

[6]  Gordon Fraser,et al.  Automated unit test generation for classes with environment dependencies , 2014, ASE.

[7]  Mika Katara,et al.  Experiences of System-Level Model-Based GUI Testing of an Android Application , 2011, 2011 Fourth IEEE International Conference on Software Testing, Verification and Validation.

[8]  Porfirio Tramontana,et al.  A toolset for GUI testing of Android applications , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[9]  Martin Schäf,et al.  Behind the Scenes: An Approach to Incorporate Context in GUI Test Case Generation , 2011, 2011 IEEE Fourth International Conference on Software Testing, Verification and Validation Workshops.

[10]  James C. King,et al.  Symbolic execution and program testing , 1976, CACM.

[11]  S. V. Subrahmanya,et al.  Object driven performance testing of Web applications , 2000, Proceedings First Asia-Pacific Conference on Quality Software.

[12]  Stephen McCamant,et al.  The Daikon system for dynamic detection of likely invariants , 2007, Sci. Comput. Program..

[13]  Scott D. Stoller,et al.  Testing Concurrent Java Programs using Randomized Scheduling , 2002, RV@FLoC.

[14]  Alessandro Orso,et al.  Scaling regression testing to large software systems , 2004, SIGSOFT '04/FSE-12.

[15]  Arie van Deursen,et al.  Invariant-based automatic testing of AJAX user interfaces , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[16]  John Micco,et al.  Taming Google-Scale Continuous Testing , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering: Software Engineering in Practice Track (ICSE-SEIP).

[17]  Gilbert Hamann,et al.  Automated performance analysis of load tests , 2009, 2009 IEEE International Conference on Software Maintenance.

[18]  Myra B. Cohen,et al.  Making system user interactive tests repeatable: when and what should we control? , 2015, ICSE 2015.

[19]  Saurabh Sinha,et al.  Guided test generation for web applications , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[20]  Yuanyuan Zhou,et al.  Learning from mistakes: a comprehensive study on real world concurrency bug characteristics , 2008, ASPLOS.

[21]  Atif M. Memon,et al.  Designing and comparing automated test oracles for GUI-based software applications , 2007, TSEM.

[22]  Gregg Rothermel,et al.  Prioritizing test cases for regression testing , 2000, ISSTA '00.

[23]  Zhenyu Chen,et al.  SITAR: GUI Test Script Repair , 2016, IEEE Transactions on Software Engineering.

[24]  Atif M. Memon,et al.  Pushing the limits on automation in GUI regression testing , 2015, 2015 IEEE 26th International Symposium on Software Reliability Engineering (ISSRE).

[25]  Atif M. Memon,et al.  Definition and evaluation of mutation operators for GUI-level mutation analysis , 2015, 2015 IEEE Eighth International Conference on Software Testing, Verification and Validation Workshops (ICSTW).

[26]  Eitan Farchi,et al.  Concurrent bug patterns and how to test them , 2003, Proceedings International Parallel and Distributed Processing Symposium.

[27]  Paolo Tonella,et al.  An Empirical Validation of a Web Fault Taxonomy and its Usage for Web Testing , 2009, J. Web Eng..

[28]  Myra B. Cohen,et al.  Covering array sampling of input event sequences for automated gui testing , 2007, ASE.

[29]  William Pugh,et al.  Unit testing concurrent software , 2007, ASE.

[30]  Darko Marinov,et al.  An empirical analysis of flaky tests , 2014, SIGSOFT FSE.

[31]  Kai-Yuan Cai,et al.  GUI Software Fault Localization Using N-gram Analysis , 2011, 2011 IEEE 13th International Symposium on High-Assurance Systems Engineering.

[32]  Mark Harman,et al.  An analysis of the relationship between conditional entropy and failed error propagation in software testing , 2014, ICSE.

[33]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[34]  Mark Harman,et al.  Automated web application testing using search based software engineering , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[35]  Xiangyu Zhang,et al.  Virtual DOM coverage for effective testing of dynamic web applications , 2014, ISSTA 2014.

[36]  Paolo Tonella,et al.  Using search-based algorithms for Ajax event sequence generation during testing , 2010, Empirical Software Engineering.

[37]  Myra B. Cohen,et al.  Repairing GUI Test Suites Using a Genetic Algorithm , 2010, 2010 Third International Conference on Software Testing, Verification and Validation.

[38]  Atif M. Memon,et al.  Conceptualization and Evaluation of Component-Based Testing Unified with Visual GUI Testing: An Empirical Study , 2015, 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST).

[39]  Gregg Rothermel,et al.  An empirical comparison of the fault-detection capabilities of internal oracles , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[40]  B. Dasarathy,et al.  Timing Constraints of Real-Time Systems: Constructs for Expressing Them, Methods of Validating Them , 1989, IEEE Transactions on Software Engineering.

[41]  Frances E. Allen,et al.  Control-flow analysis , 2022 .

[42]  Emily Hill,et al.  Automated replay and failure detection for web applications , 2005, ASE '05.

[43]  Marcelo d'Amorim,et al.  Entropy-based test generation for improved fault localization , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  Mark Harman,et al.  Augmenting test suites effectiveness by increasing output diversity , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[45]  Mark Harman,et al.  Fault localization prioritization: Comparing information-theoretic and coverage-based approaches , 2013, TSEM.

[46]  Chin-Yu Huang,et al.  Design and analysis of GUI test-case prioritization using weight-based methods , 2010, J. Syst. Softw..

[47]  Myra B. Cohen,et al.  Automated testing of GUI applications: Models, tools, and controlling flakiness , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[48]  T. H. Tse,et al.  Test case prioritization for regression testing of service-oriented business applications , 2009, WWW '09.

[49]  B. Uma Maheswari,et al.  Algorithms for the Detection of Defects in GUI Applications , 2011 .

[50]  Tim Miller,et al.  Using Dependency Structures for Prioritization of Functional Test Suites , 2013, IEEE Transactions on Software Engineering.

[51]  Gregg Rothermel,et al.  Analyzing Regression Test Selection Techniques , 1996, IEEE Trans. Software Eng..

[52]  Fadi A. Zaraket,et al.  GUICOP: Specification-Based GUI Testing , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[53]  Gail E. Kaiser,et al.  Unit test virtualization with VMVM , 2014, ICSE.

[54]  Wei Sun,et al.  BPEL4WS Unit Testing: Test Case Generation Using a Concurrent Path Analysis Approach , 2006, 2006 17th International Symposium on Software Reliability Engineering.

[55]  John Cocke,et al.  A program data flow analysis procedure , 1976, CACM.

[56]  Mark Harman,et al.  Coverage and fault detection of the output-uniqueness test selection criteria , 2014, ISSTA 2014.

[57]  Hyunsook Do,et al.  An Effective Regression Testing Approach for PHP Web Applications , 2012, 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation.

[58]  Atif M. Memon,et al.  Which of My Failures are Real? Using Relevance Ranking to Raise True Failures to the Top , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering Workshop (ASEW).

[59]  Sarfraz Khurshid,et al.  Event Listener Analysis and Symbolic Execution for Testing GUI Applications , 2009, ICFEM.

[60]  Michael D. Ernst,et al.  Empirically revisiting the test independence assumption , 2014, ISSTA 2014.