Detection of Duplicate Defect Reports Using Natural Language Processing

Defect reports are generated from various testing and development activities in software engineering. Sometimes two reports are submitted that describe the same problem, leading to duplicate reports. These reports are mostly written in structured natural language, and as such, it is hard to compare two reports for similarity with formal methods. In order to identify duplicates, we investigate using natural language processing (NLP) techniques to support the identification. A prototype tool is developed and evaluated in a case study analyzing defect reports at Sony Ericsson mobile communications. The evaluation shows that about 2/3 of the duplicates can possibly be found using the NLP techniques. Different variants of the techniques provide only minor result differences, indicating a robust technology. User testing shows that the overall attitude towards the technique is positive and that it has a growth potential.

[1]  Björn Regnell,et al.  An experiment on linguistic tool support for consolidation of requirements from multiple sources in market-driven product development , 2006, Empirical Software Engineering.

[2]  J. J. Whelan 5th international conference on software engineering , 1981, SOEN.

[3]  Hinrich Schütze,et al.  Book Reviews: Foundations of Statistical Natural Language Processing , 1999, CL.

[4]  Jane Huffman Hayes,et al.  Improving after-the-fact tracing and mapping: supporting software quality predictions , 2005, IEEE Software.

[5]  Arie van Deursen,et al.  Can LSI help reconstructing requirements traceability in design and test? , 2006, Conference on Software Maintenance and Reengineering (CSMR'06).

[6]  Björn Regnell,et al.  Speeding up requirements management in a product software company: linking customer wishes to product requirements through linguistic engineering , 2004, Proceedings. 12th IEEE International Requirements Engineering Conference, 2004..

[7]  Stephen Pulman Natural Language Processing for Requirements Specification , 1993 .

[8]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[9]  Gerardo Canfora,et al.  Impact analysis by mining software and change request repositories , 2005, 11th IEEE International Software Metrics Symposium (METRICS'05).

[10]  Bin Wang,et al.  Automated support for classifying software failure reports , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[11]  John A. Carroll,et al.  Robust, applied morphological generation , 2000, INLG.

[12]  Jane Huffman Hayes,et al.  Tracing requirements to defect reports: an application of information retrieval techniques , 2005, Innovations in Systems and Software Engineering.

[13]  Paul F. Dubois,et al.  Issue tracking , 2003, Comput. Sci. Eng..

[14]  Nicolás Serrano,et al.  Bugzilla, ITracker, and Other Bug Trackers , 2005, IEEE Softw..

[15]  E.P.W.M. van Veenendaal,et al.  Software Testing: A guide to the TMap Approach , 2001 .

[16]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[17]  Gail E. Kaiser,et al.  An Information Retrieval Approach For Automatically Constructing Software Libraries , 1991, IEEE Trans. Software Eng..

[18]  Björn Regnell,et al.  A linguistic-engineering approach to large-scale requirements management , 2005, IEEE Software.

[19]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[20]  Margaret J. Robertson,et al.  Design and Analysis of Experiments , 2006, Handbook of statistics.