Expanding Queries for Code Search Using Semantically Related API Class-names

When encountering unfamiliar programming tasks (e.g., connecting to a database), there is a need to seek potential working code examples. Instead of using code search engines, software developers usually post related programming questions on online Q&A forums (e.g., Stack Overflow). One possible reason is that existing code search engines would return effective code examples only if a query contains identifiers (e.g., class or method names). In other words, existing code search engines do not handle natural-language queries well (e.g., a description of a programming task). However, developers may not know the appropriate identifiers at the time of the search. As the demand of searching code examples is increasing, it is of significant interest to enhance code search engines. We conjecture that expanding natural-language queries with their semantically related identifiers has a great potential to enhance code search engines. In this paper, we propose an automated approach to find identifiers (in particular API class-names) that are semantically related to a given natural-language query. We evaluate the effectiveness of our approach using 74 queries on a corpus of 23,677,216 code snippets that are extracted from 24,666 open source Java projects. The results show that our approach can effectively recommend semantically related API class-names to expand the original natural-language queries. For instance, our approach successfully retrieves relevant code examples in the top 10 retrieved results for 76 percent of 74 queries, while it is 36 percent when using the original natural-language query; and the median rank of the first relevant code example is increased from 22 to 7.

[1]  R. Holmes,et al.  Using structural context to recommend source code examples , 2005, Proceedings. 27th International Conference on Software Engineering, 2005. ICSE 2005..

[2]  Chanchal Kumar Roy,et al.  Towards a Big Data Curated Benchmark of Inter-project Code Clones , 2014, 2014 IEEE International Conference on Software Maintenance and Evolution.

[3]  Jinqiu Yang,et al.  Inferring semantically related words from software context , 2012, 2012 9th IEEE Working Conference on Mining Software Repositories (MSR).

[4]  Justin Zobel,et al.  How reliable are the results of large-scale information retrieval experiments? , 1998, SIGIR '98.

[5]  Jane Huffman Hayes,et al.  Advancing candidate link generation for requirements tracing: the study of methods , 2006, IEEE Transactions on Software Engineering.

[6]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[7]  David Lo,et al.  Active code search: incorporating user feedback to improve code search relevance , 2014, ASE.

[8]  Eran Yahav,et al.  Typestate-based semantic code search over partial programs , 2012, OOPSLA '12.

[9]  Terry L King A Guide to Chi-Squared Testing , 1997 .

[10]  Gerard Salton,et al.  The SMART Retrieval System—Experiments in Automatic Document Processing , 1971 .

[11]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[12]  Hinrich Schütze,et al.  Introduction to information retrieval , 2008 .

[13]  Avinash C. Kak,et al.  Assisting code search with automatic Query Reformulation for bug localization , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[14]  Collin McMillan,et al.  Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications , 2012, IEEE Transactions on Software Engineering.

[15]  sennis About the sample size calculator , 2017 .

[16]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[17]  Emily Hill,et al.  Identifying Word Relations in Software: A Comparative Study of Semantic Similarity Tools , 2008, 2008 16th IEEE International Conference on Program Comprehension.

[18]  Sushil Krishna Bajracharya,et al.  Leveraging usage similarity for effective retrieval of examples in code repositories , 2010, FSE '10.

[19]  David Lo,et al.  Code Search via Topic-Enriched Dependence Graph Matching , 2011, 2011 18th Working Conference on Reverse Engineering.

[20]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[21]  David Lo,et al.  Finding relevant answers in software forums , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[22]  Jeffrey Xu Yu,et al.  Matching dependence-related queries in the system dependence graph , 2010, ASE.

[23]  Mónica Marrero,et al.  A comparison of the optimality of statistical significance tests for information retrieval evaluation , 2013, SIGIR.

[24]  Gerard Salton,et al.  A vector space model for automatic indexing , 1975, CACM.

[25]  Claudio Carpineto,et al.  A Survey of Automatic Query Expansion in Information Retrieval , 2012, CSUR.

[26]  David Lo,et al.  Automated construction of a software-specific word similarity database , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[27]  R. Yin Case Study Research: Design and Methods , 1984 .

[28]  Andrian Marcus,et al.  An information retrieval approach to concept location in source code , 2004, 11th Working Conference on Reverse Engineering.

[29]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[30]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[31]  Stéphane Ducasse,et al.  Semantic clustering: Identifying topics in source code , 2007, Inf. Softw. Technol..

[32]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[33]  Collin McMillan,et al.  Portfolio: Searching for relevant functions and their usages in millions of lines of code , 2013, TSEM.

[34]  Filip Radlinski,et al.  Evaluating the accuracy of implicit feedback from clicks and query reformulations in Web search , 2007, TOIS.

[35]  Emerson R. Murphy-Hill,et al.  How developers use multi-recommendation system in local code search , 2014, 2014 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC).

[36]  Tao Xie,et al.  Parseweb: a programmer assistant for reusing open source code on the web , 2007, ASE.

[37]  Koushik Sen,et al.  SNIFF: A Search Engine for Java Using Free-Form Queries , 2009, FASE.

[38]  Thorsten Joachims,et al.  Optimizing search engines using clickthrough data , 2002, KDD.

[39]  Tim Menzies,et al.  On the use of relevance feedback in IR-based concept location , 2009, 2009 IEEE International Conference on Software Maintenance.

[40]  Sushil Krishna Bajracharya,et al.  Analyzing and mining a code search engine usage log , 2010, Empirical Software Engineering.

[41]  L. R. Dice Measures of the Amount of Ecologic Association Between Species , 1945 .

[42]  M. G. Bulmer,et al.  Principles of Statistics. , 1969 .

[43]  Stephen E. Robertson,et al.  Microsoft Cambridge at TREC 13: Web and Hard Tracks , 2004, TREC.