Enabling improved IR-based feature location

Recent solutions to software engineering problems have incorporated tools and techniques from information retrieval (IR). The use of IR requires choosing an appropriate retrieval model and deciding on a query that best captures a particular information need. Taking feature location as a representative example, three research questions are investigated: (1) the impact of query preprocessing, (2) the impact that different scraping techniques for queries have on retrieval performance, (3) the performance impact that the underlying retrieval model has on identifying the correct source-code functions (the correct documents). These research questions are addressed using the five open source projects released as part of the SEMERU dataset. In the experiments, five methods of scraping queries from modification requests and seven retrieval model instances are considered. Using the standard evaluation metric Mean Reciprocal Rank (MRR), the experimental analysis reveals that better retrieval models are not the ones commonly used by software engineering researchers. Results find that models based on query-likelihood perform about twice as well as models in common use in software engineering such as LSI and thus deserve greater attention. Furthermore, corpus preprocessing has a significant impact as the top performing setting is over 100% better than the average.

[1]  W. Bruce Croft,et al.  A Markov random field model for term dependencies , 2005, SIGIR '05.

[2]  John D. Lafferty,et al.  A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval , 2017, SIGF.

[3]  Scott Grant,et al.  Estimating the Optimal Number of Latent Concepts in Source Code Analysis , 2010, 2010 10th IEEE Working Conference on Source Code Analysis and Manipulation.

[4]  Yann-Gaël Guéhéneuc,et al.  Feature Location Using Probabilistic Ranking of Methods Based on Execution Scenarios and Information Retrieval , 2007, IEEE Transactions on Software Engineering.

[5]  Emily Hill,et al.  Mining source code to automatically split identifiers for software analysis , 2009, 2009 6th IEEE International Working Conference on Mining Software Repositories.

[6]  Zhenchang Xing,et al.  Feature Location in a Collection of Product Variants , 2012, 2012 19th Working Conference on Reverse Engineering.

[7]  Stephen E. Robertson,et al.  Okapi at TREC-3 , 1994, TREC.

[8]  Emily Hill,et al.  Using natural language program analysis to locate and understand action-oriented concerns , 2007, AOSD.

[9]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[10]  Robert Krovetz,et al.  Viewing morphology as an inference process , 1993, Artif. Intell..

[11]  W. Bruce Croft,et al.  Inference Networks for Document Retrieval , 1989, SIGIR Forum.

[12]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, Proceedings. 26th International Conference on Software Engineering.

[13]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[14]  Hung Viet Nguyen,et al.  A topic-based approach for narrowing the search space of buggy files from a bug report , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[15]  Thomas Hofmann,et al.  Probabilistic Latent Semantic Indexing , 1999, SIGIR Forum.

[16]  Letha H. Etzkorn,et al.  Configuring latent Dirichlet allocation based feature location , 2014, Empirical Software Engineering.

[17]  Zhenchang Xing,et al.  A large scale Linux-Kernel based benchmark for feature location research , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[18]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[19]  W. Bruce Croft,et al.  LDA-based document models for ad-hoc retrieval , 2006, SIGIR.

[20]  Wei Zhao,et al.  SNIAFL: towards a static non-interactive approach to feature location , 2004, ICSE 2004.

[21]  M. F. Porter,et al.  An algorithm for suffix stripping , 1997 .

[22]  David W. Binkley,et al.  Understanding LDA in source code analysis , 2014, ICPC 2014.

[23]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[24]  Emily Hill,et al.  On the Use of Stemming for Concern Location and Bug Localization in Java , 2012, 2012 IEEE 12th International Working Conference on Source Code Analysis and Manipulation.

[25]  Niranjan Balasubramanian,et al.  Exploring reductions for long web queries , 2010, SIGIR.

[26]  Andrea De Lucia,et al.  Using IR methods for labeling source code artifacts: Is it worthwhile? , 2012, 2012 20th IEEE International Conference on Program Comprehension (ICPC).

[27]  Letha H. Etzkorn,et al.  Bug localization using latent Dirichlet allocation , 2010, Inf. Softw. Technol..

[28]  Andrian Marcus,et al.  Recovering documentation-to-source-code traceability links using latent semantic indexing , 2003, 25th International Conference on Software Engineering, 2003. Proceedings..

[29]  Christopher Exton,et al.  Assisting Concept Location in Software Comprehension , 2007, PPIG.

[30]  Bogdan Dit,et al.  Using Data Fusion and Web Mining to Support Feature Location in Software , 2010, 2010 IEEE 18th International Conference on Program Comprehension.

[31]  Emily Hill,et al.  A comparison of stemmers on source code identifiers for software search , 2011, 2011 27th IEEE International Conference on Software Maintenance (ICSM).

[32]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[33]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[34]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[35]  Giuliano Antoniol,et al.  Can Better Identifier Splitting Techniques Help Feature Location? , 2011, 2011 IEEE 19th International Conference on Program Comprehension.

[36]  Giuliano Antoniol,et al.  Recovering Traceability Links between Code and Documentation , 2002, IEEE Trans. Software Eng..

[37]  T. Landauer,et al.  Indexing by Latent Semantic Analysis , 1990 .

[38]  Dawn J. Lawrie,et al.  Vocabulary normalization improves IR-based concept location , 2012, 2012 28th IEEE International Conference on Software Maintenance (ICSM).

[39]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[40]  W. Bruce Croft,et al.  Search Engines - Information Retrieval in Practice , 2009 .

[41]  Iadh Ounis,et al.  Examining the Content Load of Part of Speech Blocks for Information Retrieval , 2006, ACL 2006.

[42]  Yann-Gaël Guéhéneuc,et al.  Combining Probabilistic Ranking and Latent Semantic Indexing for Feature Identification , 2006, 14th IEEE International Conference on Program Comprehension (ICPC'06).

[43]  Emily Hill,et al.  Which Feature Location Technique is Better? , 2013, 2013 IEEE International Conference on Software Maintenance.

[44]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[45]  Hamid Mcheick,et al.  An experiment in software component retrieval , 2003, Inf. Softw. Technol..

[46]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[47]  Andrea De Lucia,et al.  How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[48]  Ellen M. Voorhees,et al.  Building a question answering test collection , 2000, SIGIR '00.

[49]  R. Kirk Experimental Design: Procedures for the Behavioral Sciences , 1970 .