D&C: A Divide-and-Conquer Approach to IR-based Bug Localization

Many automated tasks in software maintenance rely on information retrieval techniques to identify specific information within unstructured data. Bug localization is such a typical task, where text in a bug report is analyzed to identify file locations in the source code that can be associated to the reported bug. Despite the promising results, the performance offered by IR-based bug localization tools is still not significant for large adoption. We argue that one reason could be the attempt to build a one-size-fits-all approach. In this paper, we extensively study the performance of state-of-the-art bug localization tools, focusing on query formulation and its importance with respect to the localization performance. Building on insights from this study, we propose a new learning approach where multiple classifier models are trained on clear-cut sets of bug-location pairs. Concretely, we apply a gradient boosting supervised learning approach to various sets of bug reports whose localizations appear to be successful with specific types of features. The training scenario builds on our findings that the various state-of-the-art localization tools can be highly performant for specific sets of bug reports. We implement D&C, which computes appropriate weights that should be assigned to the similarity measurements between pairs of information token types. Experimental results on large and up-to-date datasets reveal that D&C outperforms state-of-the-art tools. On average, the experiments yield an MAP score of 0.52, and an MRR score of 0.63 with a curated dataset, which provides a substantial performance improvement over all tools: MAP is improved by between 4 and up to 10 percentage points, while MRR is improved by between 1 and up to 12. Finally, we note that D&C is stable in its localization performance: around 50% of bugs can be located at Top1, 77% at Top5 and 85% at Top10.

[1]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[2]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[3]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[4]  Sarfraz Khurshid,et al.  Improving bug localization using structured information retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[5]  Nitesh V. Chawla,et al.  SMOTE: Synthetic Minority Over-sampling Technique , 2002, J. Artif. Intell. Res..

[6]  Andrian Marcus,et al.  On the Relationship between the Vocabulary of Bug Reports and Source Code , 2013, 2013 IEEE International Conference on Software Maintenance.

[7]  Razvan C. Bunescu,et al.  Learning to rank relevant files for bug reports using domain knowledge , 2014, SIGSOFT FSE.

[8]  S. Dumais Latent Semantic Analysis. , 2005 .

[9]  Anh Tuan Nguyen,et al.  Bug Localization with Combination of Deep Learning and Information Retrieval , 2017, 2017 IEEE/ACM 25th International Conference on Program Comprehension (ICPC).

[10]  Ying He,et al.  MSMOTE: Improving Classification Performance When Training Data is Imbalanced , 2009, 2009 Second International Workshop on Computer Science and Engineering.

[11]  Ming Wen,et al.  Locus: Locating bugs from software changes , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[12]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[13]  Sarfraz Khurshid,et al.  On the Effectiveness of Information Retrieval Based Bug Localization for C Programs , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[14]  Andreas Zeller,et al.  Where Should We Fix This Bug? A Two-Phase Recommendation Model , 2013, IEEE Transactions on Software Engineering.

[15]  Chanchal Kumar Roy,et al.  Improving IR-based bug localization with context-aware query reformulation , 2018, ESEC/SIGSOFT FSE.

[16]  Eunseok Lee,et al.  Bug Localization Based on Code Change Histories and Bug Reports , 2015, 2015 Asia-Pacific Software Engineering Conference (APSEC).

[17]  David Lo,et al.  Will this localization tool be effective for this bug? Mitigating the impact of unreliability of information retrieval based bug localization tools , 2016, Empirical Software Engineering.

[18]  Tie-Yan Liu,et al.  LightGBM: A Highly Efficient Gradient Boosting Decision Tree , 2017, NIPS.

[19]  Yves Le Traon,et al.  Bench4BL: reproducibility study on the performance of IR-based bug localization , 2018, ISSTA.

[20]  W. Bruce Croft,et al.  A general language model for information retrieval , 1999, CIKM '99.

[21]  Ricardo Baeza-Yates,et al.  Information Retrieval: Data Structures and Algorithms , 1992 .

[22]  Rongxin Wu,et al.  CrashLocator: locating crashing faults based on crash stacks , 2014, ISSTA 2014.

[23]  David Lo,et al.  Predicting Effectiveness of IR-Based Bug Localization Techniques , 2014, 2014 IEEE 25th International Symposium on Software Reliability Engineering.

[24]  Andrea De Lucia,et al.  On the Equivalence of Information Retrieval Methods for Automated Traceability Link Recovery , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[25]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[26]  Yue-Shi Lee,et al.  Cluster-based under-sampling approaches for imbalanced data distributions , 2009, Expert Syst. Appl..

[27]  David Lo,et al.  Version history, similar report, and structure: putting them together for improved bug localization , 2014, ICPC 2014.

[28]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[29]  Peter Wiemer-Hastings,et al.  Latent semantic analysis , 2004, Annu. Rev. Inf. Sci. Technol..

[30]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[31]  P. C. Wong,et al.  Generalized vector spaces model in information retrieval , 1985, SIGIR '85.

[32]  Wahiba Ben Abdessalem Karaa,et al.  Information Retrieval with Porter Stemmer: A New Version for English , 2013 .

[33]  David Lo,et al.  Compositional Vector Space Models for Improved Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[34]  Lu Zhang,et al.  Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[35]  David Lo,et al.  AmaLgam+: Composing Rich Information Sources for Accurate Bug Localization , 2016, J. Softw. Evol. Process..

[36]  Michael McGill,et al.  Introduction to Modern Information Retrieval , 1983 .

[37]  Razvan C. Bunescu,et al.  Mapping Bug Reports to Relevant Files: A Ranking Model, a Fine-Grained Benchmark, and Feature Evaluation , 2016, IEEE Transactions on Software Engineering.

[38]  Rosane Minghim,et al.  InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams , 2015, BMC Bioinformatics.

[39]  Anh Tuan Nguyen,et al.  Combining Deep Learning with Information Retrieval to Localize Buggy Files for Bug Reports (N) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[40]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[41]  Rahul Premraj,et al.  Do stack traces help developers fix bugs? , 2010, 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010).

[42]  W. Bruce Croft,et al.  Cluster-based language models for distributed retrieval , 1999, SIGIR '99.

[43]  Jacques Klein,et al.  Got issues? Who cares about it? A large scale investigation of issue trackers from GitHub , 2013, 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE).

[44]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[45]  Ed Greengrass,et al.  Information Retrieval: A Survey , 2000 .

[46]  Shinpei Hayashi,et al.  A Preliminary Study on Using Code Smells to Improve Bug Localization , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[47]  Alessandro Orso,et al.  Evaluating the usefulness of IR-based fault localization techniques , 2015, ISSTA.

[48]  Ahmed E. Hassan,et al.  The Impact of Classifier Configuration and Classifier Combination on Bug Localization , 2013, IEEE Transactions on Software Engineering.

[49]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[50]  Andrian Marcus,et al.  On the Use of Stack Traces to Improve Text Retrieval-Based Bug Localization , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[51]  Anas Mahmoud,et al.  Just enough semantics: An information theoretic approach for IR-based software bug localization , 2018, Inf. Softw. Technol..

[52]  Andrian Marcus,et al.  Using Observed Behavior to Reformulate Queries during Text Retrieval-based Bug Localization , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).