Automated duplicate detection for bug tracking systems

Bug tracking systems are important tools that guide the maintenance activities of software developers. The utility of these systems is hampered by an excessive number of duplicate bug reports-in some projects as many as a quarter of all reports are duplicates. Developers must manually identify duplicate bug reports, but this identification process is time-consuming and exacerbates the already high cost of software maintenance. We propose a system that automatically classifies duplicate bug reports as they arrive to save developer time. This system uses surface features, textual semantics, and graph clustering to predict duplicate status. Using a dataset of 29,000 bug reports from the Mozilla project, we perform experiments that include a simulation of a real-time bug reporting environment. Our system is able to reduce development cost by filtering out 8% of duplicate bug reports while allowing at least one report for each real defect to reach developers.

[1]  Andrei Z. Broder,et al.  Workshop on Algorithms and Models for the Web Graph , 2007, WAW.

[2]  Gail C. Murphy,et al.  Automatic bug triage using text categorization , 2004, SEKE.

[3]  Gail C. Murphy,et al.  Coping with an open bug repository , 2005, eclipse '05.

[4]  Björn Regnell,et al.  Speeding up requirements management in a product software company: linking customer wishes to product requirements through linguistic engineering , 2004, Proceedings. 12th IEEE International Requirements Engineering Conference, 2004..

[5]  Gerardo Canfora,et al.  How Software Repositories can Help in Resolving a New Change Request , 2005 .

[6]  Jeff Sutherland,et al.  Business objects in corporate information systems , 1995, CSUR.

[7]  Robert E. Tarjan,et al.  Clustering Social Networks , 2007, WAW.

[8]  Eric S. Raymond,et al.  The cathedral and the bazaar - musings on Linux and Open Source by an accidental revolutionary , 2001 .

[9]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[10]  Westley Weimer,et al.  Modeling bug report quality , 2007, ASE '07.

[11]  Vibhu O. Mittal,et al.  Stemming and its effects on TFIDF ranking. , 2000, SIGIR 2000.

[12]  Gail C. Murphy,et al.  Who should fix this bug? , 2006, ICSE.

[13]  Christian Robottom Reis,et al.  An Overview of the Software Engineering Process and Tools in the Mozilla Project , 2002 .

[14]  Andreas Zeller,et al.  How Long Will It Take to Fix This Bug? , 2007, Fourth International Workshop on Mining Software Repositories (MSR'07:ICSE Workshops 2007).

[15]  Sunghun Kim,et al.  How long did it take to fix bugs? , 2006, MSR '06.