Amalgamated Models for Detecting Duplicate Bug Reports

Automatic identification of duplicate bug reports is a critical research problem in the software repositories’ mining area. The aim of this paper is to propose and compare amalgamated models for detecting duplicate bug reports using textual and non-textual information of bug reports. The algorithmic models viz. LDA, TF-IDF, GloVe, Word2Vec, and their amalgamation are used to rank bug reports according to their similarity with each other. The amalgamated score is generated by aggregating the ranks generated by models. The empirical evaluation has been performed on the open datasets from large open source software projects. The metrics used for evaluation are mean average precision (MAP), mean reciprocal rank (MRR) and recall rate. The experimental results show that amalgamated model (TF-IDF + Word2Vec + LDA) outperforms other amalgamated models for duplicate bug recommendations. It is also concluded that amalgamation of Word2Vec with TF-IDF models works better than TF-IDF with GloVe. The future scope of current work is to develop a python package that allows the user to select the individual models and their amalgamation with other models on a given dataset.

[1]  Ayse Basar Bener,et al.  Rediscovery Datasets: Connecting Duplicate Reports , 2017, 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR).

[2]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[3]  Y. Raghu Reddy,et al.  Towards Word Embeddings for Improved Duplicate Bug Report Retrieval in Software Repositories , 2018, ICTIR.

[4]  Marco Tulio Valente,et al.  An Empirical Study on Recommendations of Similar Bugs , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[5]  R. Ledesma,et al.  Cliff's Delta Calculator: A non-parametric effect size program for two groups of observations , 2010 .

[6]  Xinli Yang,et al.  Combining Word Embedding with Information Retrieval to Recommend Similar Bug Reports , 2016, 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE).

[7]  Sheng Tang,et al.  A density-based method for adaptive LDA model selection , 2009, Neurocomputing.

[8]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[9]  Pradeep Singh,et al.  Weight similarity measurement model based, object oriented approach for bug databases mining to detect similar and duplicate bugs , 2009, ICAC3 '09.

[10]  Abram Hindle,et al.  Preventing duplicate bug reports by continuously querying bug reports , 2018, Empirical Software Engineering.

[11]  Nicholas A. Kraft,et al.  New features for duplicate bug detection , 2014, MSR 2014.

[12]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[13]  Ahmed E. Hassan,et al.  Studying the needed effort for identifying duplicate issues , 2015, Empirical Software Engineering.

[14]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[15]  Nicholas Jalbert,et al.  Automated duplicate detection for bug tracking systems , 2008, 2008 IEEE International Conference on Dependable Systems and Networks With FTCS and DCC (DSN).

[16]  Eleni Stroulia,et al.  A contextual approach towards more accurate duplicate bug report detection and ranking , 2013, Empirical Software Engineering.

[17]  Jaiteg Singh,et al.  Using Latent Semantic Analysis to Identify Research Trends in OpenStreetMap , 2017, ISPRS Int. J. Geo Inf..

[18]  Siau-Cheng Khoo,et al.  Towards more accurate retrieval of duplicate bug reports , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[19]  Bonita Sharif,et al.  Improving the accuracy of duplicate bug report detection using textual similarity measures , 2014, MSR 2014.

[20]  Ling Xu,et al.  Automated Duplicate Bug Report Detection Using Multi-Factor Analysis , 2016, IEICE Trans. Inf. Syst..

[21]  Ashish Sureka,et al.  Detecting Duplicate Bug Report Using Character N-Gram-Based Features , 2010, 2010 Asia Pacific Software Engineering Conference.

[22]  M. Narasimha Murty,et al.  On Finding the Natural Number of Topics with Latent Dirichlet Allocation: Some Observations , 2010, PAKDD.