A Comparative Study of the Performance of IR Models on Duplicate Bug Detection

Open source projects incorporate bug triagers to help with the task of bug report assignment to developers. One of the tasks of a triager is to identify whether an incoming bug report is a duplicate of a pre-existing report. In order to detect duplicate bug reports, a triager either relies on his memory and experience or on the search capabilities of the bug repository. Both these approaches can be time consuming for the triager and may also lead to the misidentification of duplicates. In the literature, several approaches to automate duplicate bug report detection have been proposed. However, there has not been an exhaustive comparison of the performance of different IR models, especially with topic-based ones such as LSI and LDA. In this paper, we compare the performance of the traditional vector space model (using different weighting schemes) with that of topic based models, leveraging heuristics that incorporate exception stack frames, surface features, summary and long description from the free-form text in the bug report. We perform experiments on subsets of bug reports from Eclipse and Firefox and achieve a recall rate of 60% and 58% respectively. We find that word-based models, in particular a Log-Entropy based weighting scheme, outperform topic based ones such as LSI, LDA and Random Projections. Our findings also suggests that for the problem of duplicate bug detection, it is important to consider a project's domain and characteristics to devise a set of heuristics to achieve optimal results.

[1]  J. Herbsleb,et al.  Two case studies of open source software development: Apache and Mozilla , 2002, TSEM.

[2]  Thomas Zimmermann,et al.  Duplicate bug reports considered harmful … really? , 2008, 2008 IEEE International Conference on Software Maintenance.

[3]  Jane Huffman Hayes,et al.  Good Benchmarks are Hard To Find: Toward the Benchmark for Information Retrieval Applications in Software Engineering , 2006 .

[4]  Björn Regnell,et al.  An experiment on linguistic tool support for consolidation of requirements from multiple sources in market-driven product development , 2006, Empirical Software Engineering.

[5]  Anders Holst,et al.  Random indexing of text samples for latent semantic analysis , 2000 .

[6]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[7]  Rahul Premraj,et al.  Do stack traces help developers fix bugs? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[8]  Magnus Sahlgren,et al.  An Introduction to Random Indexing , 2005 .

[9]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[10]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[11]  Gregory Tassey,et al.  Prepared for what , 2007 .

[12]  Barry Boehm,et al.  Top 10 list [software development] , 2001 .

[13]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[14]  Thomas Hofmann,et al.  Unsupervised Learning by Probabilistic Latent Semantic Analysis , 2004, Machine Learning.

[15]  Barry W. Boehm,et al.  Software Defect Reduction Top 10 List , 2001, Computer.

[16]  G. A. Miller THE PSYCHOLOGICAL REVIEW THE MAGICAL NUMBER SEVEN, PLUS OR MINUS TWO: SOME LIMITS ON OUR CAPACITY FOR PROCESSING INFORMATION 1 , 1956 .

[17]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[18]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[19]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[20]  Susan T. Dumais,et al.  Improving the retrieval of information from external sources , 1991 .

[21]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.