An incremental update framework for efficient retrieval from software libraries for bug localization

Information Retrieval (IR) based bug localization techniques use a bug reports to query a software repository to retrieve relevant source files. These techniques index the source files in the software repository and train a model which is then queried for retrieval purposes. Much of the current research is focused on improving the retrieval effectiveness of these methods. However, little consideration has been given to the efficiency of such approaches for software repositories that are constantly evolving. As the software repository evolves, the index creation and model learning have to be repeated to ensure accuracy of retrieval for each new bug. In doing so, the query latency may be unreasonably high, and also, re-computing the index and the model for files that did not change is computationally redundant. We propose an incremental update framework to continuously update the index and the model using the changes made at each commit. We demonstrate that the same retrieval accuracy can be achieved but with a fraction of the time needed by current approaches. Our results are based on two basic IR modeling techniques - Vector Space Model (VSM) and Smoothed Unigram Model (SUM). The dataset we used in our validation experiments was created by tracking commit history of AspectJ and JodaTime software libraries over a span of 10 years.

[1]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[2]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[3]  Rajeev Motwani,et al.  Incremental Clustering and Dynamic Information Retrieval , 2004, SIAM J. Comput..

[4]  Tao Tao,et al.  Language Model Information Retrieval with Document Expansion , 2006, NAACL.

[5]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[6]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[7]  W. Bruce Croft,et al.  Fast Incremental Indexing for Full-Text Information Retrieval , 1994, VLDB.

[8]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[9]  Charles L. A. Clarke,et al.  Indexing time vs. query time: trade-offs in dynamic information retrieval systems , 2005, CIKM '05.

[10]  Carl K. Chang,et al.  Incremental Latent Semantic Indexing for Automatic Traceability Link Evolution Management , 2008, 2008 23rd IEEE/ACM International Conference on Automated Software Engineering.

[11]  Alexander Egyed,et al.  Proceedings of the twenty-second IEEE/ACM international conference on Automated software engineering , 2007, ASE 2007.

[12]  Gabriele Bavota,et al.  Evaluating the specificity of text retrieval queries to support software engineering tasks , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[13]  Yann-Gaël Guéhéneuc,et al.  Improving Bug Location Using Binary Class Relationships , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[14]  Ricardo Baeza-Yates,et al.  A Comparison of Open Source Search Engines , 2007 .

[15]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[16]  Michael English,et al.  An empirical analysis of information retrieval based concept location techniques in software comprehension , 2008, Empirical Software Engineering.

[17]  Gerardo Canfora,et al.  Fine grained indexing of software repositories to support impact analysis , 2006, MSR '06.

[18]  Per Runeson,et al.  Detection of Duplicate Defect Reports Using Natural Language Processing , 2007, 29th International Conference on Software Engineering (ICSE'07).

[19]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[20]  A. Nowacki,et al.  Understanding Equivalence and Noninferiority Testing , 2011, Journal of General Internal Medicine.

[21]  John Yen,et al.  An incremental approach to building a cluster hierarchy , 2002, 2002 IEEE International Conference on Data Mining, 2002. Proceedings..

[22]  Hector Garcia-Molina,et al.  Incremental updates of inverted lists for text document retrieval , 1994, SIGMOD '94.

[23]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[24]  James Allan,et al.  A comparison of statistical significance tests for information retrieval evaluation , 2007, CIKM '07.

[25]  T. Chiueh,et al.  Eecient Real-time Index Updates in Text Retrieval Systems , 1999 .

[26]  Avinash C. Kak,et al.  moreBugs: A New Dataset for Benchmarking Algorithms for Information Retrieval from Soft ware Repositories , 2013 .

[27]  Andreas Zeller,et al.  Mining version histories to guide software changes , 2005, Proceedings. 26th International Conference on Software Engineering.

[28]  Harald C. Gall,et al.  Proceedings of the 2006 international workshop on Mining software repositories , 2006, International Conference on Software Engineering.

[29]  Martin P. Robillard,et al.  Non-essential changes in version histories , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[30]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[31]  Zhenchang Xing,et al.  Concern Localization using Information Retrieval: An Empirical Study on Linux Kernel , 2011, 2011 18th Working Conference on Reverse Engineering.

[32]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[33]  K. Phillips Power of the two one-sided tests procedure in bioequivalence , 1990, Journal of Pharmacokinetics and Biopharmaceutics.

[34]  Sonia Haiduc,et al.  Automatically detecting the quality of the query and its implications in IR-based concept location , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[35]  Fazli Can,et al.  Incremental clustering for dynamic information processing , 1993, TOIS.

[36]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).