Empirically Detecting False Test Alarms Using Association Rules

Applying code changes to software systems and testing these code changes can be a complex task that involves many different types of software testing strategies, e.g. system and integration tests. However, not all test failures reported during code integration are hinting towards code defects. Testing large systems such as the Microsoft Windows operating system requires complex test infrastructures, which may lead to test failures caused by faulty tests and test infrastructure issues. Such false test alarms are particular annoying as they raise engineer attention and require manual inspection without providing any benefit. The goal of this work is to use empirical data to minimize the number of false test alarms reported during system and integration testing. To achieve this goal, we use association rule learning to identify patterns among failing test steps that are typically for false test alarms and can be used to automatically classify them. A successful classification of false test alarms is particularly valuable for product teams as manual test failure inspection is an expensive and time-consuming process that not only costs engineering time and money but also slows down product development. We evaluating our approach on system and integration tests executed during Windows 8.1 and Microsoft Dynamics AX development. Performing more than 10,000 classifications for each product, our model shows a mean precision between 0.85 and 0.90 predicting between 34% and 48% of all false test alarms.

[1]  Ashok Pon Kumar,et al.  Improving efficiency in software maintenance , 2011, MSR '11.

[2]  Rainer Koschke,et al.  Effort-Aware Defect Prediction Models , 2010, 2010 14th European Conference on Software Maintenance and Reengineering.

[3]  Nachiappan Nagappan,et al.  The impact of test ownership and team structure on the reliability and effectiveness of quality test runs , 2014, ESEM '14.

[4]  Tomasz Imielinski,et al.  Mining association rules between sets of items in large databases , 1993, SIGMOD Conference.

[5]  Mark Harman,et al.  Automated Session Data Repair for Web Application Regression Testing , 2008, 2008 1st International Conference on Software Testing, Verification, and Validation.

[6]  Christian Bird,et al.  Assessing the value of branches with what-if analysis , 2012, SIGSOFT FSE.

[7]  James A. Jones,et al.  Concept-based failure clustering , 2012, SIGSOFT FSE.

[8]  Andreas Zeller,et al.  Generating test cases for specification mining , 2010, ISSTA '10.

[9]  Andreas Zeller,et al.  Mining Cause-Effect-Chains from Version Histories , 2011, 2011 IEEE 22nd International Symposium on Software Reliability Engineering.

[10]  Victor R. Basili,et al.  The influence of organizational structure on software quality , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[11]  Gregg Rothermel,et al.  A safe, efficient regression test selection technique , 1997, TSEM.

[12]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[13]  Atif M. Memon,et al.  Automatically repairing event sequence-based GUI test suites for regression testing , 2008, TSEM.

[14]  Foutse Khomh,et al.  Is it a bug or an enhancement?: a text-based approach to classify change requests , 2008, CASCON '08.

[15]  J. Czerwonka Branching Taxonomy , 2014 .

[16]  Chao Liu,et al.  Failure proximity: a fault localization-based approach , 2006, SIGSOFT '06/FSE-14.

[17]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[18]  Gordon Fraser,et al.  EvoSuite: automatic test suite generation for object-oriented software , 2011, ESEC/FSE '11.

[19]  Oscar Nierstrasz,et al.  Ordering broken unit tests for focused debugging , 2004, 20th IEEE International Conference on Software Maintenance, 2004. Proceedings..

[20]  Gregg Rothermel,et al.  Test case prioritization: an empirical study , 1999, Proceedings IEEE International Conference on Software Maintenance - 1999 (ICSM'99). 'Software Maintenance for Business Change' (Cat. No.99CB36360).

[21]  Darko Marinov,et al.  On test repair using symbolic execution , 2010, ISSTA '10.

[22]  Andreas Zeller,et al.  Simplifying failure-inducing input , 2000, ISSTA '00.

[23]  Alessandro Orso,et al.  Understanding myths and realities of test-suite evolution , 2012, SIGSOFT FSE.

[24]  Darko Marinov,et al.  ReAssert: Suggesting Repairs for Broken Unit Tests , 2009, 2009 IEEE/ACM International Conference on Automated Software Engineering.

[25]  Shane McIntosh,et al.  The impact of code review coverage and code review participation on software quality: a case study of the qt, VTK, and ITK projects , 2014, MSR 2014.

[26]  Andreas Zeller,et al.  eROSE: guiding programmers in eclipse , 2005, OOPSLA '05.

[27]  Ingo Scholtes,et al.  Categorizing bugs with social networks: A case study on four open source software communities , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[28]  Ferhat Khendek,et al.  Test Selection Based on Finite State Models , 1991, IEEE Trans. Software Eng..

[29]  David Leon,et al.  Pursuing failure: the distribution of program failures in a profile space , 2001, ESEC/FSE-9.

[30]  Brendan Murphy,et al.  CODEMINE: Building a Software Development Data Analytics Platform at Microsoft , 2013, IEEE Software.

[31]  Brad Calder,et al.  Automatically characterizing large scale program behavior , 2002, ASPLOS X.

[32]  Kim Herzig Capturing the long-term impact of changes , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[33]  Damien Cassou,et al.  Test Quality Feedback Improving Effectivity and Efficiency of Unit Testing , 2012, 2012 10th International Conference on Creating, Connecting and Collaborating through Computing.

[34]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[35]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[36]  David Leon,et al.  Tree-based methods for classifying software failures , 2004, 15th International Symposium on Software Reliability Engineering.

[37]  Adam A. Porter,et al.  A history-based test prioritization technique for regression testing in resource constrained environments , 2002, ICSE '02.

[38]  Kurt Hornik,et al.  The arules R-Package Ecosystem: Analyzing Interesting Patterns from Large Transaction Data Sets , 2011, J. Mach. Learn. Res..

[39]  Sarfraz Khurshid,et al.  Specification-Based Test Repair Using a Lightweight Formal Method , 2012, FM.

[40]  James M. Rehg,et al.  Active learning for automatic classification of software behavior , 2004, ISSTA '04.

[41]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[42]  Lu Zhang,et al.  Is This a Bug or an Obsolete Test? , 2013, ECOOP.

[43]  Philip J. Guo,et al.  Characterizing and predicting which bugs get fixed: an empirical study of Microsoft Windows , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.