Are the Code Snippets What We Are Searching for? A Benchmark and an Empirical Study on Code Search with Natural-Language Queries

Code search methods, especially those that allow programmers to raise queries in a natural language, plays an important role in software development. It helps to improve programmers' productivity by returning sample code snippets from the Internet and/or source-code repositories for their natural-language queries. Meanwhile, there are many code search methods in the literature that support natural-language queries. Difficulties exist in recognizing the strengths and weaknesses of each method and choosing the right one for different usage scenarios, because (1) the implementations of those methods and the datasets for evaluating them are usually not publicly available, and (2) some methods leverage different training datasets or auxiliary data sources and thus their effectiveness cannot be fairly measured and may be negatively affected in practical uses. To build a common ground for measuring code search methods, this paper builds CosBench, a dataset that consists of 1000 projects, 52 code-independent natural-language queries with ground truths, and a set of scripts for calculating four metrics on code research results. We have evaluated four IR (Information Retrieval)-based and two DL (Deep Learning)-based code search methods on CosBench. The empirical evaluation results clearly show the usefulness of the CosBench dataset and various strengths of each code search method. We found that DL-based methods are more suitable for queries on reusing code, and IR-based ones for queries on resolving bugs and learning API uses.

[1]  Anita Sarma,et al.  ANNE: Improving Source Code Search using Entity Retrieval Approach , 2017, WSDM.

[2]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[3]  Gabriele Bavota,et al.  How Can I Use This Method? , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[4]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[5]  Chanchal Kumar Roy,et al.  Effective Reformulation of Query for Code Search Using Crowdsourced Knowledge and Extra-Large Data Analytics , 2018, 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[6]  Gaurav Khandelwal,et al.  Bing developer assistant: improving developer productivity by recommending sample code , 2016, SIGSOFT FSE.

[7]  Carlo Strapparava,et al.  Corpus-based and Knowledge-based Measures of Text Semantic Similarity , 2006, AAAI.

[8]  Jacques Klein,et al.  FaCoY – A Code-to-Code Search Engine , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[9]  Karen Spärck Jones A statistical interpretation of term specificity and its application in retrieval , 2021, J. Documentation.

[10]  Xiaochen Li,et al.  Query Expansion Based on Crowd Knowledge for Code Search , 2016, IEEE Transactions on Services Computing.

[11]  Kathryn T. Stolee,et al.  Solving the Search for Source Code , 2014, ACM Trans. Softw. Eng. Methodol..

[12]  John R. Koza,et al.  Automated Design of Both the Topology and Sizing of Analog Electrical Circuits Using Genetic Programming , 1996 .

[13]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[14]  Lee Martie,et al.  Understanding the impact of support for iteration on code search , 2017, ESEC/SIGSOFT FSE.

[15]  Xiao Ma,et al.  From Word Embeddings to Document Similarities for Improved Information Retrieval in Software Engineering , 2016, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

[16]  Zhenchang Xing,et al.  What do developers search for on the web? , 2017, Empirical Software Engineering.

[17]  David Lo,et al.  Active code search: incorporating user feedback to improve code search relevance , 2014, ASE.

[18]  Chanchal Kumar Roy,et al.  RACK: Code Search in the IDE Using Crowdsourced Knowledge , 2017, 2017 IEEE/ACM 39th International Conference on Software Engineering Companion (ICSE-C).

[19]  Cristina V. Lopes,et al.  How Well Do Search Engines Support Code Retrieval on the Web? , 2011, TSEM.

[20]  Vipin Balachandran,et al.  Query by example in large-scale code repositories , 2015, 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[21]  松田 直人 『Google Scholar』の利点 , 2009 .

[22]  André van der Hoek,et al.  CodeExchange: Supporting Reformulation of Internet-Scale Code Queries in Context (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[23]  Kathryn T. Stolee,et al.  How developers search for code: a case study , 2015, ESEC/SIGSOFT FSE.

[24]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[25]  Ying Zou,et al.  Expanding Queries for Code Search Using Semantically Related API Class-names , 2018, IEEE Transactions on Software Engineering.

[26]  Joel Galenson Dynamic and Interactive Synthesis of Code Snippets , 2014 .

[27]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[28]  Bin Li,et al.  Interactive Query Reformulation for Source-Code Search With Word Relations , 2018, IEEE Access.

[29]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[30]  Susan Elliott Sim,et al.  A Controlled Experiment on the Process Used by Developers During Internet-Scale Code Search , 2013, Finding Source Code on the Web for Remix and Reuse.

[31]  Jacques Klein,et al.  Augmenting and structuring user queries to support efficient free-form code search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[32]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[33]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[34]  Beijun Shen,et al.  Lancer: Your Code Tell Me What You Need , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[35]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[37]  Ying Zou,et al.  Learning to rank code examples for code search engines , 2017, Empirical Software Engineering.

[38]  Zhenchang Xing,et al.  What help do developers seek, when and how? , 2013, 2013 20th Working Conference on Reverse Engineering (WCRE).

[39]  Christoph Treude,et al.  NLP2Code: Code Snippet Content Assist via Natural Language Tasks , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[40]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[41]  Frank Rosenblatt,et al.  PRINCIPLES OF NEURODYNAMICS. PERCEPTRONS AND THE THEORY OF BRAIN MECHANISMS , 1963 .

[42]  Satish Chandra,et al.  Neural Code Search Evaluation Dataset , 2019, ArXiv.

[43]  Yang Yang,et al.  Code Search Based on Alteration Intent , 2019, IEEE Access.

[44]  Bernard J. Jansen,et al.  The seventeen theoretical constructs of information searching and information retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[45]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[46]  Mukund Raghothaman,et al.  SWIM: Synthesizing What I Mean - Code Search and Idiomatic Snippet Synthesis , 2015, 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).