A Large-Scale Comparative Evaluation of IR-Based Tools for Bug Localization

This paper reports on a large-scale comparative evaluation of IR-based tools for automatic bug localization. We have divided the tools in our evaluation into the following three generations: (1) The first-generation tools, now over a decade old, that are based purely on the Bag-of-Words (BoW) modeling of software libraries. (2) The somewhat more recent second-generation tools that augment BoW-based modeling with two additional pieces of information: historical data, such as change history, and structured information such as class names, method names, etc. And, finally, (3) The third-generation tools that are currently the focus of much research and that also exploit proximity, order, and semantic relationships between the terms. It is important to realize that the original authors of all these three generations of tools have mostly tested them on relatively small-sized datasets that typically consisted no more than a few thousand bug reports. Additionally, those evaluations only involved Java code libraries. The goal of the present paper is to present a comprehensive large-scale evaluation of all three generations of bug-localization tools with code libraries in multiple languages. Our study involves over 20,000 bug reports drawn from a diverse collection of Java, C/C++, and Python projects. Our results show that the third-generation tools are significantly superior to the older tools. We also show that the word embeddings generated using code files written in one language are effective for retrieval from code libraries in other languages.

[1]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[2]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[3]  David Lo,et al.  Deep Transfer Bug Localization , 2019, IEEE Transactions on Software Engineering.

[4]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[5]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[6]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[7]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[8]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[9]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[10]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[11]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[12]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[13]  John D. Lafferty,et al.  A study of smoothing methods for language models applied to Ad Hoc information retrieval , 2001, SIGIR '01.

[14]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[15]  Chanchal Kumar Roy,et al.  Improving IR-based bug localization with context-aware query reformulation , 2018, ESEC/SIGSOFT FSE.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Anh Tuan Nguyen,et al.  Bug Localization with Combination of Deep Learning and Information Retrieval , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[18]  Yan Xiao,et al.  Improving bug localization with word embedding and enhanced convolutional neural networks , 2019, Inf. Softw. Technol..

[19]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[20]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[21]  Eunseok Lee,et al.  Bug Localization Based on Code Change Histories and Bug Reports , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[22]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[23]  Marc Roper,et al.  Bug localisation through diverse sources of information , 2013, 2013 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW).

[24]  Trong Duc Nguyen,et al.  Combining Word2Vec with Revised Vector Space Model for Better Code Retrieval , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[25]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[26]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[27]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[28]  Yves Le Traon,et al.  Bench4BL: reproducibility study on the performance of IR-based bug localization , 2018, ISSTA.

[29]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[30]  Marc Roper,et al.  Using Bug Report Similarity to Enhance Bug Localisation , 2012, 2012 19th Working Conference on Reverse Engineering.

[31]  Avinash C. Kak,et al.  Exploiting spatial code proximity and order for improved source code retrieval for bug localization , 2017, J. Softw. Evol. Process..

[32]  Osamu Mizuno,et al.  Using a Distributed Representation of Words in Localizing Relevant Files for Bug Reports , 2016, 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[33]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[34]  David Lo,et al.  RCLinker: Automated Linking of Issue Reports and Commits Leveraging Rich Contextual Information , 2015, 2015 IEEE 23rd International Conference on Program Comprehension.

[35]  Andrian Marcus,et al.  On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[36]  Brent D. Nichols Augmented bug localization using past bug information , 2010, ACM SE '10.

[37]  Avinash C. Kak,et al.  SCOR: Source Code Retrieval with Semantics and Order , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).