745 ) Details Type : Bug Status : CLOSED Priority : Major Resolution : Complete Affects

Many automated tasks in software maintenance rely on information retrieval (IR) techniques to identify specific information within unstructured data. Bug localization is such a typical task, where text in a bug report is analyzed to identify file locations in the source code that can be associated to the reported bug. Unfortunately, despite the promising results reported in the literature, the performance offered by IR-based bug localization tools is still not significant for large adoption. We argue that one reason could be the attempt by the community to build a “one-size-fits-all” approach for bug localization, without fully addressing the differences of available information that may exist among the bug reports and across the project source code files. In this paper, we first extensively study the performance of state-of-the-art bug localization tools, specifically focusing on investigating the query formulation (i.e., which bug report features should be compared against which features of source code files) and its importance with respect to the localization performance. Building on insights from this study, we propose a new learning approach where multiple classifier models are trained on clear-cut sets of bug-location pairs. Concretely, we apply a gradient boosting supervised learning approach to various sets of bug reports whose localizations appear to be successful with specific types of features. The training scenario builds on our findings that the various state-of-the-art localization tools (hence the associated similarity features that they leverage) can be highly performant for specific sets of bug reports. We implement D&C, a multi-classifier approach, which computes appropriate weights that should be assigned to the similarity measurements between pairs of information token types (the bug report and source code). Experimental results on large and up-to-date datasets reveal that D&C outperforms state-of-the-art tools. On average, the validation experiments yield an MAP score of 0.52, and an MRR score of 0.63 with a curated dataset. Comparison against the state-of-the-art shows that D&C provides a substantial performance improvement of MAP and MRR over all tools: MAP is improved by between 4 and up to 10 percentage points, while MRR is improved by between 1 and up to 12. Finally, we note that D&C is stable in its localization performance: around 50% of bugs can be located at Top1, 77% at Top5 and 85% at Top10.

[1]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[2]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[3]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[4]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[5]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[6]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[7]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[8]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[9]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[10]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[11]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[12]  P. Kantor Foundations of Statistical Natural Language Processing , 2001, Information Retrieval.

[13]  S. Dumais Latent Semantic Analysis. , 2005 .

[14]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[15]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[16]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[17]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[18]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[19]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[20]  Rahul Premraj,et al.  Do stack traces help developers fix bugs? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[21]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[22]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[23]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[24]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[25]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[26]  Wahiba Ben Abdessalem Karaa,et al.  Information Retrieval with Porter Stemmer: A New Version for English , 2013 .

[27]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[28]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[29]  Jacques Klein,et al.  Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[30]  Andrian Marcus,et al.  On the Relationship between the Vocabulary of Bug Reports and Source Code , 2013, 2013 IEEE International Conference on Software Maintenance.

[31]  Rongxin Wu,et al.  CrashLocator: locating crashing faults based on crash stacks , 2014, ISSTA 2014.

[32]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[33]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[34]  Andrian Marcus,et al.  On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[35]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[36]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[37]  Sarfraz Khurshid,et al.  On the Effectiveness of Information Retrieval Based Bug Localization for C Programs , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[38]  David Lo,et al.  Predicting Effectiveness of IR-Based Bug Localization Techniques , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[39]  Rosane Minghim,et al.  InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams , 2015, BMC Bioinformatics.

[40]  Eunseok Lee,et al.  Bug Localization Based on Code Change Histories and Bug Reports , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[41]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[42]  Alessandro Orso,et al.  Evaluating the usefulness of IR-based fault localization techniques , 2015, ISSTA.

[43]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[44]  David Lo,et al.  AmaLgam+: Composing Rich Information Sources for Accurate Bug Localization , 2016, J. Softw. Evol. Process..

[45]  David Lo,et al.  Will this localization tool be effective for this bug? Mitigating the impact of unreliability of information retrieval based bug localization tools , 2016, Empirical Software Engineering.

[46]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[47]  Anh Tuan Nguyen,et al.  Bug Localization with Combination of Deep Learning and Information Retrieval , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[48]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[49]  Andrian Marcus,et al.  Using Observed Behavior to Reformulate Queries during Text Retrieval-based Bug Localization , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[50]  Yves Le Traon,et al.  Bench4BL: reproducibility study on the performance of IR-based bug localization , 2018, ISSTA.

[51]  Shinpei Hayashi,et al.  A Preliminary Study on Using Code Smells to Improve Bug Localization , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[52]  Chanchal Kumar Roy,et al.  Improving IR-based bug localization with context-aware query reformulation , 2018, ESEC/SIGSOFT FSE.

[53]  Anas Mahmoud,et al.  Just enough semantics: An information theoretic approach for IR-based software bug localization , 2018, Inf. Softw. Technol..