NS3: Neuro-Symbolic Semantic Code Search

Semantic code search is the task of retrieving a code snippet given a textual description of its functionality. Recent work has been focused on using similarity metrics between neural embeddings of text and code. However, current language models are known to struggle with longer, compositional text, and multi-step reasoning. To overcome this limitation, we propose supplementing the query sentence with a layout of its semantic structure. The semantic layout is used to break down the final reasoning decision into a series of lower-level decisions. We use a Neural Module Network architecture to implement this idea. We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods, and evaluate on two datasets - CodeSearchNet and Code Search and Question Answering. We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.

[1]  Xiaofei Xie,et al.  GraphSearchNet: Enhancing GNNs via Capturing Global Dependencies for Semantic Code Search , 2021, IEEE Transactions on Software Engineering.

[2]  Lei Yuan,et al.  A Neural Network Architecture for Program Understanding Inspired by Human Behaviors , 2022, ACL.

[3]  Xin Wang,et al.  CODE-MVP: Learning to Represent Source Code from Multiple Views with Contrastive Pre-Training , 2022, NAACL-HLT.

[4]  Seung-won Hwang,et al.  ReACC: A Retrieval-Augmented Code Completion Framework , 2022, Annual Meeting of the Association for Computational Linguistics.

[5]  Ming Zhou,et al.  UniXcoder: Unified Cross-Modal Pre-training for Code Representation , 2022, ACL.

[6]  Graham Neubig,et al.  In-IDE Code Generation from Natural Language: Promise and Challenges , 2021, ACM Trans. Softw. Eng. Methodol..

[7]  Beijun Shen,et al.  Cross-Domain Deep Code Search with Few-Shot Meta Learning , 2022, ArXiv.

[8]  Dongmei Zhang,et al.  Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search , 2021, CIKM.

[9]  Dongmei Zhang,et al.  CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees , 2021, EMNLP.

[10]  Nan Duan,et al.  CoSQA: 20,000+ Web Queries for Code Search and Question Answering , 2021, ACL.

[11]  Jure Leskovec,et al.  Language-Agnostic Representation Learning of Source Code from Structure and Context , 2021, ICLR.

[12]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[13]  Michael R. Lyu,et al.  CRaDLe: Deep Code Retrieval Based on Semantic Dependency Learning , 2020, Neural Networks.

[14]  S. Ji,et al.  Deep Graph Matching and Searching for Semantic Code Retrieval , 2020, ACM Trans. Knowl. Discov. Data.

[15]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[16]  Lingxiao Jiang,et al.  Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations , 2020, SIGIR.

[17]  Joseph E. Gonzalez,et al.  Contrastive Code Representation Learning , 2020, EMNLP.

[18]  JinYeong Bak,et al.  Learning Sequential and Structural Information for Source Code Summarization , 2021, FINDINGS.

[19]  Tom Van Cutsem,et al.  Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent , 2020, ArXiv.

[20]  Angelica Willis,et al.  Evaluating Compositionality of Sentence Representation Models , 2020, REPL4NLP.

[21]  Jinjun Xiong,et al.  A Multi-Perspective Architecture for Semantic Code Search , 2020, ACL.

[22]  Baishakhi Ray,et al.  A Transformer-based Approach for Source Code Summarization , 2020, ACL.

[23]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[24]  Wei Ye,et al.  Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning , 2020, WWW.

[25]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[26]  Aditya Kanade,et al.  Learning and Evaluating Contextual Embedding of Source Code , 2019, ICML.

[27]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[28]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[29]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[30]  Avinash C. Kak,et al.  SCOR: Source Code Retrieval with Semantics and Order , 2019, 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR).

[31]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[32]  Marco Baroni,et al.  Linguistic generalization and compositionality in modern artificial neural networks , 2019, Philosophical Transactions of the Royal Society B.

[33]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.

[34]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[35]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[36]  Gang Zhao,et al.  DeepSim: deep learning code functional similarity , 2018, ESEC/SIGSOFT FSE.

[37]  Minghui Zhou,et al.  A Neural Framework for Retrieval and Summarization of Source Code , 2018, 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE).

[38]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[39]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[40]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[41]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[42]  Dan Klein,et al.  Neural Module Networks , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[43]  Luke S. Zettlemoyer,et al.  Broad-coverage CCG Semantic Parsing with AMR , 2015, EMNLP.

[44]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[45]  Aapo Hyvärinen,et al.  Noise-contrastive estimation: A new estimation principle for unnormalized statistical models , 2010, AISTATS.

[46]  Steven P. Reiss,et al.  Semantics-based code search demonstration proposal , 2009, ICSM.

[47]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[48]  Michael W. Godfrey,et al.  Semantic grep: regular expressions + relational abstraction , 2002, Ninth Working Conference on Reverse Engineering, 2002. Proceedings..