Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation

In bug localization, a developer uses information about a bug to locate the portion of the source code to modify to correct the bug. Developers expend considerable effort performing this task. Some recent static techniques for automatic bug localization have been built around modern information retrieval (IR) models such as latent semantic indexing (LSI); however, latent Dirichlet allocation (LDA), a modular and extensible IR model, has significant advantages over both LSI and probabilistic LSI (pLSI). In this paper we present an LDA-based static technique for automating bug localization. We describe the implementation of our technique and three case studies that measure its effectiveness. For two of the case studies we directly compare our results to those from similar studies performed using LSI. The results demonstrate our LDA-based technique performs at least as well as the LSI-based techniques for all bugs and performs better, often significantly so, than the LSI-based techniques for most bugs.

[1]  Denys Poshyvanyk,et al.  Feature location via information retrieval based filtering of a single scenario execution trace , 2007, ASE.

[2]  Andrian Marcus,et al.  Static techniques for concept location in object-oriented code , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[3]  Ata Kabán,et al.  On an equivalence between PLSI and LDA , 2003, SIGIR.

[4]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[5]  Hugo Zaragoza,et al.  Information Retrieval: Algorithms and Heuristics , 2002, Information Retrieval.

[6]  Yann-Gaël Guéhéneuc,et al.  Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[7]  Andrian Marcus,et al.  Supporting program comprehension using semantic and structural information , 2001, Proceedings of the 23rd International Conference on Software Engineering. ICSE 2001.

[8]  Sushil Krishna Bajracharya,et al.  Mining concepts from code with probabilistic topic models , 2007, ASE.

[9]  Stéphane Ducasse,et al.  Enriching reverse engineering with semantic clustering , 2005, 12th Working Conference on Reverse Engineering (WCRE'05).

[10]  Tom Minka,et al.  Expectation-Propogation for the Generative Aspect Model , 2002, UAI.

[11]  Hausi A. Müller,et al.  Reverse engineering: a roadmap , 2000, ICSE '00.

[12]  Letha H. Etzkorn,et al.  Semantic software metrics computed from natural language design specifications , 2008, IET Softw..

[13]  Brad A. Myers,et al.  A Linguistic Analysis of How People Describe Software Problems , 2006, Visual Languages and Human-Centric Computing (VL/HCC'06).

[14]  David W. Binkley,et al.  Leveraged Quality Assessment using Information Retrieval Techniques , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[15]  Santonu Sarkar,et al.  Mining business topics in source code using latent dirichlet allocation , 2008, ISEC '08.

[16]  Christos Tjortjis,et al.  Clustering data retrieved from Java source code to support software maintenance: a case study , 2005, Ninth European Conference on Software Maintenance and Reengineering.

[17]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[18]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[19]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[20]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[21]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[22]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Yann-Gaël Guéhéneuc,et al.  Feature identification: a novel approach and a case study , 2005, 21st IEEE International Conference on Software Maintenance (ICSM'05).

[24]  Andrea De Lucia,et al.  Working Session: Information Retrieval Based Approaches in Software Evolution , 2006, 2006 22nd IEEE International Conference on Software Maintenance.

[25]  Denys Poshyvanyk,et al.  Combining Formal Concept Analysis with Information Retrieval for Concept Location in Source Code , 2007, 15th IEEE International Conference on Program Comprehension (ICPC '07).