Performance evaluation of VSM and LSI models to determine bug reports similarity

Bug reports of open source software systems are increasing exponentially. One reason for growing bug reports is that bug reporters do not browse the bug repository before submitting a bug report. There may be some similar bugs already reported: one, which are exactly similar or duplicate and other, which are semantically similar means they may belong to the same software component or files. The information contained in the previously reported similar bugs can be helpful in fixing and resolving the newly reported bugs. In this paper, we applied and compared performance of two information retrieval (IR) models: Vector Space Model (VSM) and Latent Semantic Indexing (LSI), in extracting existing similar bug reports. The performance of these two models have been evaluated based on the Top Ten results retrieved by them for relevant bug reports. Experiments have been conducted on 106 bug reports of three components from Google chrome, browser. Result shows that LSI performs better in most cases in comparison to VSM.

[1]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[2]  Philip J. Guo,et al.  "Not my bug!" and other reasons for software bug report reassignments , 2011, CSCW.

[3]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[4]  Alan F. Smeaton,et al.  Natural language processing and information retrieval , 1990, Inf. Process. Manag..

[5]  Brent D. Nichols Augmented bug localization using past bug information , 2010, ACM SE '10.

[6]  Satish R. Kolhe,et al.  Information Retrieval Based on Semantic Similarity Using Information Content , 2011 .

[7]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[8]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[9]  Susan T. Dumais,et al.  The vocabulary problem in human-system communication , 1987, CACM.

[10]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[11]  William M. Pottenger,et al.  A framework for understanding Latent Semantic Indexing (LSI) performance , 2006, Inf. Process. Manag..

[12]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[13]  Ashish Sureka,et al.  Learning to Classify Bug Reports into Components , 2012, TOOLS.