Deep Graph Matching and Searching for Semantic Code Retrieval

Code retrieval is to find the code snippet from a large corpus of source code repositories that highly matches the query of natural language description. Recent work mainly uses natural language processing techniques to process both query texts (i.e., human natural language) and code snippets (i.e., machine programming language), however neglecting the deep structured features of natural language query texts and source codes, both of which contain rich semantic information. In this paper, we propose an end-to-end deep graph matching and searching (DGMS) model based on graph neural networks for semantic code retrieval. To this end, we first represent both natural language query texts and programming language codes with the unified graph-structured data, and then use the proposed graph matching and searching model to retrieve the best matching code snippet. In particular, DGMS not only captures more structural information for individual query texts or code snippets but also learns the fine-grained similarity between them by a cross-attention based semantic matching operation. We evaluate the proposed DGMS model on two public code retrieval datasets from two representative programming languages (i.e., Java and Python). The experiment results demonstrate that DGMS significantly outperforms state-of-the-art baseline models by a large margin on both datasets. Moreover, our extensive ablation studies systematically investigate and illustrate the impact of each part of DGMS.

[1]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[2]  Wenbing Huang,et al.  Deep Graph Learning: Foundations, Advances and Applications , 2020, KDD.

[3]  Marcel Urner Formal Syntax And Semantics Of Programming Languages A Laboratory Based Approach , 2016 .

[4]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[5]  Walter Daelemans,et al.  Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , 2014, EMNLP 2014.

[6]  Max Welling,et al.  Semi-Supervised Classification with Graph Convolutional Networks , 2016, ICLR.

[7]  Yu Chen,et al.  Iterative Deep Graph Learning for Graph Neural Networks: Better and Robust Node Embeddings , 2019, NeurIPS.

[8]  Yizhou Sun,et al.  SimGNN: A Neural Network Approach to Fast Graph Similarity Computation , 2018, WSDM.

[9]  Charu C. Aggarwal,et al.  Scalable Global Alignment Graph Kernel Using Random Features: From Node Embedding to Graph Embedding , 2019, KDD.

[10]  Jure Leskovec,et al.  Inductive Representation Learning on Large Graphs , 2017, NIPS.

[11]  Pietro Liò,et al.  Graph Attention Networks , 2017, ICLR.

[12]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[13]  Ah Chung Tsoi,et al.  The Graph Neural Network Model , 2009, IEEE Transactions on Neural Networks.

[14]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[15]  Jure Leskovec,et al.  How Powerful are Graph Neural Networks? , 2018, ICLR.

[16]  Emily Hill,et al.  Improving source code search with natural language phrasal representations of method signatures , 2011, 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011).

[17]  Jan Eric Lenssen,et al.  Fast Graph Representation Learning with PyTorch Geometric , 2019, ArXiv.

[18]  Yann LeCun,et al.  Signature Verification Using A "Siamese" Time Delay Neural Network , 1993, Int. J. Pattern Recognit. Artif. Intell..

[19]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[20]  Collin McMillan,et al.  Improved Code Summarization via a Graph Neural Network , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[21]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[22]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[23]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[24]  Jeffrey C Grossman,et al.  Crystal Graph Convolutional Neural Networks for an Accurate and Interpretable Prediction of Material Properties. , 2017, Physical review letters.

[25]  Bing Xue,et al.  KerGM: Kernelized Graph Matching , 2019, NeurIPS.

[26]  Philipp Koehn,et al.  Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2016 .

[27]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[28]  Rajeev Motwani,et al.  The PageRank Citation Ranking : Bringing Order to the Web , 1999, WWW 1999.

[29]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[30]  Xiang Ling,et al.  Multilevel Graph Matching Networks for Deep Graph Similarity Learning , 2020, IEEE Transactions on Neural Networks and Learning Systems.

[31]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[32]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[33]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[34]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[35]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[36]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[37]  Davide Bacciu,et al.  A Fair Comparison of Graph Neural Networks for Graph Classification , 2020, ICLR.

[38]  Satish Chandra,et al.  Neural Code Search Evaluation Dataset , 2019, ArXiv.

[39]  Anima Anandkumar,et al.  Open Vocabulary Learning on Source Code with a Graph-Structured Cache , 2018, ICML.

[40]  Noam Chomsky,et al.  Three models for the description of language , 1956, IRE Trans. Inf. Theory.

[41]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing , 2000 .

[42]  Pierre Vandergheynst,et al.  Geometric Deep Learning: Going beyond Euclidean data , 2016, IEEE Signal Process. Mag..

[43]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[44]  Krystian Mikolajczyk,et al.  Learning local feature descriptors with triplets and shallow convolutional neural networks , 2016, BMVC.

[45]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[46]  Renuka Sindhgatta Using an information retrieval system to retrieve source code samples , 2006, ICSE '06.

[47]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[48]  Yixin Chen,et al.  Link Prediction Based on Graph Neural Networks , 2018, NeurIPS.

[49]  Koushik Sen,et al.  Retrieval on source code: a neural code search , 2018, MAPL@PLDI.

[50]  Omer Levy,et al.  Structural Language Models of Code , 2019, ICML.

[51]  Zhiguo Wang,et al.  Bilateral Multi-Perspective Matching for Natural Language Sentences , 2017, IJCAI.

[52]  Philip S. Yu,et al.  A Comprehensive Survey on Graph Neural Networks , 2019, IEEE Transactions on Neural Networks and Learning Systems.

[53]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[54]  Kenneth Slonneger,et al.  Formal syntax and semantics of programming languages , 1994 .

[55]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[56]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[57]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[58]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[59]  Jinjun Xiong,et al.  A Multi-Perspective Architecture for Semantic Code Search , 2020, ACL.

[60]  Mihai Surdeanu,et al.  The Stanford CoreNLP Natural Language Processing Toolkit , 2014, ACL.

[61]  Chunming Wu,et al.  Hierarchical Graph Matching Networks for Deep Graph Similarity Learning , 2019, ArXiv.

[62]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[63]  Samy Bengio,et al.  Links between perceptrons, MLPs and SVMs , 2004, ICML.

[64]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[65]  Shuohang Wang,et al.  A Compare-Aggregate Model for Matching Text Sequences , 2016, ICLR.

[66]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[67]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.