Leveraging textual properties of bug reports to localize relevant source files

Abstract Bug reports are an essential part of a software project's life cycle since resolving them improves the project's quality. When a new bug report is received, developers usually need to reproduce the bug and perform code review to locate the bug and assign it to be fixed. However, the huge number of bug reports and the increasing size of software projects make this process tedious and time-consuming. To solve this issue, bug localization techniques try to rank all the source files of a project with respect to how likely they are to contain a bug. This process reduces the search space of source files and helps developers to find relevant source files quicker. In this paper, we propose a multi-component bug localization approach that leverages different textual properties of bug reports and source files as well as the relations between previously fixed bug reports and a newly received one. Our approach uses information retrieval, textual matching, stack trace analysis, and multi-label classification to improve the performance of bug localization. We evaluate the performance of the proposed approach on three open source software projects (i.e., AspectJ, SWT, and ZXing) and the results show that it can rank appropriate source files for more than 52% of bugs by recommending only one source file and 78% by recommending ten files. It also improves the MRR and MAP values compared to several existing state-of-the-art bug localization approaches.

[1]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[2]  Andrew McCallum,et al.  A comparison of event models for naive bayes text classification , 1998, AAAI 1998.

[3]  David Broman,et al.  Automated bug assignment: Ensemble-based machine learning in large scale industrial contexts , 2016, Empirical Software Engineering.

[4]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[5]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[6]  Chao Liu,et al.  Statistical Debugging: A Hypothesis Testing-Based Approach , 2006, IEEE Transactions on Software Engineering.

[7]  David Lo,et al.  AmaLgam+: Composing Rich Information Sources for Accurate Bug Localization , 2016, J. Softw. Evol. Process..

[8]  Yu Zhou,et al.  Augmenting Bug Localization with Part-of-Speech and Invocation , 2017, Int. J. Softw. Eng. Knowl. Eng..

[9]  Rainer Storn,et al.  Differential Evolution – A Simple and Efficient Heuristic for global Optimization over Continuous Spaces , 1997, J. Glob. Optim..

[10]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[11]  Per Runeson,et al.  Supporting Change Impact Analysis Using a Recommendation System: An Industrial Case Study in a Safety-Critical Context , 2017, IEEE Transactions on Software Engineering.

[12]  David Lo,et al.  Predicting Effectiveness of IR-Based Bug Localization Techniques , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[13]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[14]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[15]  Thomas Zimmermann,et al.  What Makes a Good Bug Report? , 2008, IEEE Transactions on Software Engineering.

[16]  Prasenjit Majumder,et al.  Effective aggregation of various summarization techniques , 2018, Inf. Process. Manag..

[17]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[18]  Rahul Premraj,et al.  Do stack traces help developers fix bugs? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[19]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[20]  Alessandro Orso,et al.  Are automated debugging techniques actually helping programmers? , 2011, ISSTA '11.

[21]  Jeffrey G. Gray,et al.  Impact of structural weighting on a latent Dirichlet allocation–based feature location technique , 2018, J. Softw. Evol. Process..

[22]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[23]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[24]  Norman Wilde,et al.  The role of concepts in program comprehension , 2002, Proceedings 10th International Workshop on Program Comprehension.

[25]  Peter Zoeteweij,et al.  A practical evaluation of spectrum-based fault localization , 2009, J. Syst. Softw..

[26]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[27]  Reza Gharibi,et al.  Locating relevant source files for bug reports using textual analysis , 2017, 2017 International Symposium on Computer Science and Software Engineering Conference (CSSE).

[28]  Osamu Mizuno,et al.  Using a Distributed Representation of Words in Localizing Relevant Files for Bug Reports , 2016, 2016 IEEE International Conference on Software Quality, Reliability and Security (QRS).

[29]  Mary Jean Harrold,et al.  Empirical evaluation of the tarantula automatic fault-localization technique , 2005, ASE.

[30]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[31]  Lingfeng Bao,et al.  “Automated Debugging Considered Harmful” Considered Harmful: A User Study Revisiting the Usefulness of Spectra-Based Fault Localization Techniques with Professionals Using Real Bugs from Large Systems , 2016, 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[32]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[33]  Siau-Cheng Khoo,et al.  A discriminative model approach for accurate duplicate bug report retrieval , 2010, 2010 ACM/IEEE 32nd International Conference on Software Engineering.

[34]  Tao Xie,et al.  An approach to detecting duplicate bug reports using natural language and execution information , 2008, 2008 ACM/IEEE 30th International Conference on Software Engineering.

[35]  Gail C. Murphy,et al.  Automatic Summarization of Bug Reports , 2014, IEEE Transactions on Software Engineering.

[36]  Zarinah Mohd Kasirun,et al.  Why so complicated? Simple term filtering and weighting for location-based bug report assignment recommendation , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[37]  Senthil Mani,et al.  AUSUM: approach for unsupervised bug report summarization , 2012, SIGSOFT FSE.

[38]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[39]  T. Zimmermann,et al.  Predicting Faults from Cached History , 2007, 29th International Conference on Software Engineering (ICSE'07).

[40]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[41]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[42]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[43]  F. Wilcoxon Individual Comparisons by Ranking Methods , 1945 .

[44]  Gabriele Bavota,et al.  Mining Unstructured Data in Software Repositories: Current and Future Trends , 2016, 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[45]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[46]  Anas Mahmoud,et al.  Just enough semantics: An information theoretic approach for IR-based software bug localization , 2018, Inf. Softw. Technol..

[47]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[48]  Lars Grunske,et al.  Dimensions and Metrics for Evaluating Recommendation Systems , 2014, Recommendation Systems in Software Engineering.

[49]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track , 2001, LREC.

[50]  Xiao-Ying Liu,et al.  Measuring Semantic Similarity in Wordnet , 2007, 2007 International Conference on Machine Learning and Cybernetics.

[51]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[52]  Thomas Zimmermann,et al.  Improving bug triage with bug tossing graphs , 2009, ESEC/FSE '09.

[53]  Hongyu Zhang,et al.  An investigation of the relationships between lines of code and defects , 2009, 2009 IEEE International Conference on Software Maintenance.

[54]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[55]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[56]  Patrick Mäder,et al.  Software traceability: trends and future directions , 2014, FOSE.

[57]  Per Runeson,et al.  Software Engineers' Information Seeking Behavior in Change Impact Analysis - An Interview Study , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[58]  Tao Zhang,et al.  PRST: A PageRank-Based Summarization Technique for Summarizing Bug Reports with Duplicates , 2017, Int. J. Softw. Eng. Knowl. Eng..

[59]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[60]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[61]  David Lo,et al.  Information retrieval and spectrum based bug localization: better together , 2015, ESEC/SIGSOFT FSE.

[62]  David Lo,et al.  Practitioners' expectations on automated fault localization , 2016, ISSTA.

[63]  Alessandro Orso,et al.  Evaluating the usefulness of IR-based fault localization techniques , 2015, ISSTA.

[64]  Lars Grunske,et al.  A learning-to-rank based fault localization approach using likely invariants , 2016, ISSTA.

[65]  Emerson R. Murphy-Hill,et al.  The design of bug fixes , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[66]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[67]  Andrian Marcus,et al.  On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[68]  Gail C. Murphy,et al.  Coping with an open bug repository , 2005, eclipse '05.

[69]  Kazi Sakib,et al.  An improved bug localization using structured information retrieval and version history , 2015, 2015 18th International Conference on Computer and Information Technology (ICCIT).

[70]  Per Runeson,et al.  Recovering from a decade: a systematic mapping of information retrieval approaches to software traceability , 2013, Empirical Software Engineering.

[71]  David Lo,et al.  A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).