Assisting code search with automatic Query Reformulation for bug localization

Source code retrieval plays an important role in many software engineering tasks. However, designing a query that can accurately retrieve the relevant software artifacts can be challenging for developers as it requires a certain level of knowledge and experience regarding the code base. This paper demonstrates how the difficulty of designing a proper query can be alleviated through automatic Query Reformulation (QR) - an under-the-hood operation for reformulating a user's query with no additional input from the user. The proposed QR framework works by enriching a user's search query with certain specific additional terms drawn from the highest-ranked artifacts retrieved in response to the initial query. The important point here is that these additional terms injected into a query are those that are deemed to be “close” to the original query terms in the source code on the basis of positional proximity. This similarity metric is based on the notion that terms that deal with the same concepts in source code are usually proximal to one another in the same files. We demonstrate the superiority of our QR framework in relation to the QR frameworks well-known in the natural language document retrieval by showing significant improvements in bug localization performance for two large software projects using more than 4,000 queries.

[1]  Avinash C. Kak,et al.  Incorporating version histories in Information Retrieval based bug localization , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[2]  K. Sparck Jones,et al.  Simple, proven approaches to text retrieval , 1994 .

[3]  Avinash C. Kak,et al.  Retrieval from software libraries for bug localization: a comparative study of generic and composite text models , 2011, MSR '11.

[4]  Gerard Salton,et al.  Research and Development in Information Retrieval , 1982, Lecture Notes in Computer Science.

[5]  Rongxin Wu,et al.  ReLink: recovering links between bugs and changes , 2011, ESEC/FSE '11.

[6]  Stephen E. Robertson,et al.  Selecting good expansion terms for pseudo-relevance feedback , 2008, SIGIR '08.

[7]  ChengXiang Zhai,et al.  A comparative study of methods for estimating query language models with pseudo feedback , 2009, CIKM.

[8]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[9]  Markus Pizka,et al.  Concise and consistent naming , 2005, 13th International Workshop on Program Comprehension (IWPC'05).

[10]  Letha H. Etzkorn,et al.  Source Code Retrieval for Bug Localization Using Latent Dirichlet Allocation , 2008, 2008 15th Working Conference on Reverse Engineering.

[11]  William G. Griswold,et al.  Foundations of Software Engineering (FSE-10) , 2002 .

[12]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[13]  Tao Tao,et al.  An exploration of proximity measures in information retrieval , 2007, SIGIR.

[14]  Xiangji Huang,et al.  Proximity-based rocchio's model for pseudo relevance , 2012, SIGIR '12.

[15]  Iadh Ounis,et al.  University of Glasgow at TREC 2006: Experiments in Terabyte and Enterprise Tracks with Terrier , 2006, TREC.

[16]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[17]  Iadh Ounis,et al.  University of Glasgow at TREC 2004: Experiments in Web, Robust, and Terabyte Tracks with Terrier , 2004, TREC.

[18]  Thomas Zimmermann,et al.  Extraction of bug localization benchmarks from history , 2007, ASE.

[19]  ChengXiang Zhai,et al.  Positional relevance model for pseudo-relevance feedback , 2010, SIGIR.

[20]  Ted J. Biggerstaff,et al.  Program understanding and the concept assignment problem , 1994, CACM.

[21]  Bogdan Dit,et al.  Feature location in source code: a taxonomy and survey , 2013, J. Softw. Evol. Process..

[22]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[23]  Thomas Zimmermann,et al.  Preprocessing CVS Data for Fine-Grained Analysis , 2004, MSR.

[24]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[25]  Gabriele Bavota,et al.  Automatic query performance assessment during the retrieval of software artifacts , 2012, 2012 Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering.

[26]  W. Bruce Croft,et al.  Relevance-Based Language Models , 2001, SIGIR '01.

[27]  Jian Zhou,et al.  Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[28]  Dinesh Kumar,et al.  Data Mining for Environment Monitoring , 2012 .

[29]  Jaime Teevan,et al.  Implicit feedback for inferring user preference: a bibliography , 2003, SIGF.

[30]  Premkumar T. Devanbu,et al.  The missing links: bugs and bug-fix commits , 2010, FSE '10.

[31]  Iadh Ounis,et al.  Inferring Query Performance Using Pre-retrieval Predictors , 2004, SPIRE.

[32]  J. J. Rocchio,et al.  Relevance feedback in information retrieval , 1971 .