CodeRetriever: Unimodal and Bimodal Contrastive Learning

In this paper, we propose the CodeRetriever model, which combines the unimodal and bimodal contrastive learning to train functionlevel code semantic representations, specifically for the code search task. For unimodal contrastive learning, we design a semantic-guided method to build positive code pairs based on the documentation and function name. For bimodal contrastive learning, we leverage the documentation and inline comments of code to build text-code pairs. Both contrastive objectives can fully leverage the large-scale code corpus for pretraining. Experimental results on several public benchmarks, (i.e., CodeSearch, CoSQA, etc.) demonstrate the effectiveness of CodeRetriever in the zero-shot setting. By fine-tuning with domain/language specified downstream data, CodeRetriever achieves the new state-ofthe-art performance with significant improvement over existing code pre-trained models. We will make the code, model checkpoint, and constructed datasets publicly available.

[1]  Nan Duan,et al.  CoSQA: 20,000+ Web Queries for Code Search and Question Answering , 2021, ACL.

[2]  Danqi Chen,et al.  Dense Passage Retrieval for Open-Domain Question Answering , 2020, EMNLP.

[3]  Hieu Tran,et al.  CoTexT: Multi-task Learning with Code-Text Transformer , 2021, NLP4PROG.

[4]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[5]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[6]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[7]  Li Li,et al.  CLSEBERT: Contrastive Learning for Syntax Enhanced Code Pre-Trained Model , 2021, ArXiv.

[8]  Yiming Yang,et al.  On the Sentence Embeddings from BERT for Semantic Textual Similarity , 2020, EMNLP.

[9]  Kai-Wei Chang,et al.  Retrieval Augmented Code Generation and Summarization , 2021, EMNLP.

[10]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[11]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[12]  Baishakhi Ray,et al.  Contrastive Learning for Source Code with Structural and Functional Properties , 2021, ArXiv.

[13]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[14]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[15]  Lingxiao Jiang,et al.  Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations , 2020, SIGIR.

[16]  Luyu Gao,et al.  Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup , 2021, REPL4NLP.

[17]  Yelong Shen,et al.  A Latent Semantic Model with Convolutional-Pooling Structure for Information Retrieval , 2014, CIKM.

[18]  Tom Van Cutsem,et al.  Neural Code Search Revisited: Enhancing Code Snippet Retrieval through Natural Language Intent , 2020, ArXiv.

[19]  Fuzheng Zhang,et al.  ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer , 2021, ACL.

[20]  David A. Hull Xerox TREC-8 Question Answering Track Report , 1999, TREC.

[21]  Hua Wu,et al.  RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering , 2020, NAACL.

[22]  Neel Sundaresan,et al.  CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation , 2021, NeurIPS Datasets and Benchmarks.

[23]  Ye Li,et al.  Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval , 2020, ArXiv.

[24]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[25]  Tianyu Gao,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[26]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[27]  Charles Sutton,et al.  SCELMo: Source Code Embeddings from Language Models , 2020, ArXiv.

[28]  Aditya Kanade,et al.  Pre-trained Contextual Embedding of Source Code , 2019, ArXiv.

[29]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[30]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[31]  Larry P. Heck,et al.  Learning deep structured semantic models for web search using clickthrough data , 2013, CIKM.

[32]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[33]  Daniel S. Weld,et al.  StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow , 2018, WWW.

[34]  Ion Stoica,et al.  Contrastive Code Representation Learning , 2021, EMNLP.

[35]  Satish Chandra,et al.  Neural Code Search Evaluation Dataset , 2019, ArXiv.