deGraphCS: Embedding Variable-based Flow Graph for Neural Code Search

With the rapid increase in the amount of public code repositories, developers maintain a great desire to retrieve precise code snippets by using natural language. Despite existing deep learning based approaches (e.g., DeepCS and MMAN) have provided the end-to-end solutions (i.e., accepts natural language as queries and shows related code fragments retrieved directly from code corpus), the accuracy of code search in the large-scale repositories is still limited by the code representation (e.g., AST) and modeling (e.g., directly fusing the features in the attention stage). In this paper, we propose a novel learnable deep Graph for Code Search (called DEGRAPHCS), to transfer source code into variable-based flow graphs based on the intermediate representation technique, which can model code semantics more precisely compared to process the code as text directly or use the syntactic tree representation. Furthermore, we propose a well-designed graph optimization mechanism to refine the code representation, and apply an improved gated graph neural network to model variablebased flow graphs. To evaluate the effectiveness of DEGRAPHCS, we collect a large-scale dataset from GitHub containing 41,152 code snippets written in C language, and reproduce several typical deep code search methods for comparison. Besides, we design a qualitative user study to verify the practical value of our approach. The experimental results have shown that DEGRAPHCS can ar X iv :2 10 3. 13 02 0v 3 [ cs .S E ] 1 6 O ct 2 02 1 A PREPRINT OCTOBER 19, 2021 achieve state-of-the-art performances, and accurately retrieve code snippets satisfying the needs of the users.

[1]  Torsten Hoefler,et al.  Neural Code Comprehension: A Learnable Representation of Code Semantics , 2018, NeurIPS.

[2]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[3]  Philip S. Yu,et al.  Improving Automatic Source Code Summarization via Deep Reinforcement Learning , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[4]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[5]  Minghui Zhou,et al.  A Neural Framework for Retrieval and Summarization of Source Code , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[6]  Xin Xia,et al.  Improving Code Search with Co-Attentive Representation Learning , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[7]  Sushil Krishna Bajracharya,et al.  CodeGenie:: a tool for test-driven source code search , 2007, OOPSLA '07.

[8]  Vikram S. Adve,et al.  LLVM: a compilation framework for lifelong program analysis & transformation , 2004, International Symposium on Code Generation and Optimization, 2004. CGO 2004..

[9]  Gabriele Bavota,et al.  Deep Learning Similarities from Different Representations of Source Code , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[10]  David Lo,et al.  Deep Code Comment Generation , 2018, 2018 IEEE/ACM 26th International Conference on Program Comprehension (ICPC).

[11]  Emily Hill,et al.  Improving source code search with natural language phrasal representations of method signatures , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[12]  Sushil Krishna Bajracharya,et al.  Sourcerer: a search engine for open source code supporting structure-based search , 2006, OOPSLA '06.

[13]  Christopher D. Manning,et al.  Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks , 2015, ACL.

[14]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[15]  Collin McMillan,et al.  Exemplar: A Source Code Search Engine for Finding Highly Relevant Applications , 2012, IEEE Transactions on Software Engineering.

[16]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[17]  David Lo,et al.  Active code search: incorporating user feedback to improve code search relevance , 2014, ASE.

[18]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[19]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[20]  Jane Cleland-Huang,et al.  Learning effective query transformations for enhanced requirements trace retrieval , 2013, 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[21]  Scott R. Klemmer,et al.  Example-centric programming: integrating web search into the development environment , 2010, CHI.

[22]  Christoph Treude,et al.  NLP2Code: Code Snippet Content Assist via Natural Language Tasks , 2017, 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[23]  Ying Zou,et al.  Spotting working code examples , 2014, ICSE.

[24]  Cristina V. Lopes,et al.  Thesaurus-based automatic query expansion for interface-driven code search , 2014, MSR 2014.

[25]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[26]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[27]  Emily Hill,et al.  NL-based query refinement and contextualized code search results: A user study , 2014, 2014 Software Evolution Week - IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE).

[28]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[29]  Gail E. Kaiser,et al.  Code relatives: detecting similarly behaving software , 2016, SIGSOFT FSE.

[30]  Aditya V. Thakur,et al.  Path-Based Function Embedding and its Application to Specification Mining , 2018, ArXiv.

[31]  Robert J. Walker,et al.  The end-to-end use of source code examples: An exploratory study , 2009, 2009 IEEE International Conference on Software Maintenance.

[32]  Kathryn T. Stolee Finding suitable programs: Semantic search with incomplete and lightweight specifications , 2012, 2012 34th International Conference on Software Engineering (ICSE).

[33]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[34]  Truyen Tran,et al.  A deep language model for software code , 2016, FSE 2016.

[35]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[36]  Hong Cheng,et al.  Searching connected API subgraph via text phrases , 2012, SIGSOFT FSE.

[37]  Zhenchang Xing,et al.  BIKER: a tool for Bi-information source based API method recommendation , 2019, ESEC/SIGSOFT FSE.

[38]  Frank Hutter,et al.  Fixing Weight Decay Regularization in Adam , 2017, ArXiv.

[39]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[40]  Yoshiaki Fukazawa,et al.  Improving Syntactical Clone Detection Methods through the Use of an Intermediate Representation , 2020, 2020 IEEE 14th International Workshop on Software Clones (IWSC).

[41]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[42]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[43]  Mark N. Wegman,et al.  Efficiently computing static single assignment form and the control dependence graph , 1991, TOPL.

[44]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[45]  Xuan Li,et al.  Relationship-aware code search for JavaScript frameworks , 2016, SIGSOFT FSE.

[46]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[47]  Zhenchang Xing,et al.  API Method Recommendation without Worrying about the Task-API Knowledge Gap , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[48]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[49]  Sushil Krishna Bajracharya,et al.  A test-driven approach to code search and its application to the reuse of auxiliary functionality , 2011, Inf. Softw. Technol..

[50]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[51]  Gabriele Bavota,et al.  Automatic query reformulations for text retrieval in software engineering , 2013, 2013 35th International Conference on Software Engineering (ICSE).

[52]  Ming Li,et al.  Supervised Deep Features for Software Functional Clone Detection by Exploiting Lexical and Syntactical Information in Source Code , 2017, IJCAI.

[53]  Shuvendu K. Lahiri,et al.  Code vectors: understanding programs through embedded abstracted symbolic traces , 2018, ESEC/SIGSOFT FSE.

[54]  Leonidas J. Guibas,et al.  Learning Program Embeddings to Propagate Feedback on Student Code , 2015, ICML.

[55]  Thomas S. Heinze,et al.  Detection of Similar Functions Through the Use of Dominator Information , 2020, 2020 IEEE International Conference on Autonomic Computing and Self-Organizing Systems Companion (ACSOS-C).