CoCoSoDa: Effective Contrastive Learning for Code Search

Code search aims to retrieve semantically relevant code snippets for a given natural language query. Recently, many approaches employing contrastive learning have shown promising results on code representation learning and greatly improved the performance of code search. However, there is still a lot of room for improvement in using contrastive learning for code search. In this paper, we propose CoCoSoDa to effectively utilize contrastive learning for code search via two key factors in contrastive learning: data augmentation and negative samples. Specifically, soft data augmentation is to dynamically masking or replacing some tokens with their types for input sequences to generate positive samples. Momentum mechanism is used to generate large and consistent representations of negative samples in a mini-batch through maintaining a queue and a momentum encoder. In addition, multimodal contrastive learning is used to pull together representations of code-query pairs and push apart the unpaired code snippets and queries. We conduct extensive experiments to evaluate the effectiveness of our approach on a large-scale dataset with six programming languages. Experimental results show that: (1) CoCoSoDa outperforms 14 baselines and especially exceeds CodeBERT, GraphCodeBERT, and UniXcoder by 13.3%, 10.5%, and 5.9% on average MRR scores, respectively. (2) The ablation studies show the effectiveness of each component of our approach. (3) We adapt our techniques to several different pre-trained models such as RoBERTa, CodeBERT, and GraphCodeBERT and observe a significant boost in their performance in code search. (4) Our model performs robustly under different hyper-parameters. Furthermore, we perform qualitative and quantitative analyses to explore reasons behind the good performance of our model.

[1]  Lun Du,et al.  A large-scale empirical study of commit message generation: models, datasets and evaluation , 2022, Empirical Software Engineering.

[2]  Michael R. Lyu,et al.  Accelerating Code Search with Deep Hashing and Code Classification , 2022, ACL.

[3]  Ming Zhou,et al.  UniXcoder: Unified Cross-Modal Pre-training for Code Representation , 2022, ACL.

[4]  Hongyu Zhang,et al.  RACE: Retrieval-augmented Commit Message Generation , 2022, EMNLP.

[5]  B. Luo,et al.  SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations , 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[6]  Hongyu Zhang,et al.  On the Evaluation of Neural Code Summarization , 2021, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE).

[7]  Chao Liu,et al.  Opportunities and Challenges in Code Search Tools , 2020, ACM Comput. Surv..

[8]  Dongmei Zhang,et al.  Is a Single Model Enough? MuCoS: A Multi-Model Ensemble Learning Approach for Semantic Code Search , 2021, CIKM.

[9]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[10]  Dongmei Zhang,et al.  CAST: Enhancing Code Summarization with Hierarchical Splitting and Reconstruction of Abstract Syntax Trees , 2021, EMNLP.

[11]  Li Li,et al.  SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation , 2021, 2108.04556.

[12]  Dongmei Zhang,et al.  On the Evaluation of Commit Message Generation Models: An Experimental Study , 2021, 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[13]  Martin Monperrus,et al.  Multimodal Representation for Neural Code Search , 2021, 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[14]  Nan Duan,et al.  CoSQA: 20,000+ Web Queries for Code Search and Question Answering , 2021, ACL.

[15]  Danqi Chen,et al.  SimCSE: Simple Contrastive Learning of Sentence Embeddings , 2021, EMNLP.

[16]  Aakash Bansal,et al.  Project-Level Encoding for Neural Source Code Summarization of Subroutines , 2021, 2021 IEEE/ACM 29th International Conference on Program Comprehension (ICPC).

[17]  Hui Li,et al.  Code Completion by Modeling Flattened Abstract Syntax Trees as Graphs , 2021, AAAI.

[18]  Kai-Wei Chang,et al.  Unified Pre-training for Program Understanding and Generation , 2021, NAACL.

[19]  Paul N. Bennett,et al.  COCO-LM: Correcting and Contrasting Text Sequences for Language Model Pretraining , 2021, NeurIPS.

[20]  Feng Wang,et al.  Understanding the Behaviour of Contrastive Loss , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21]  Michael R. Lyu,et al.  CRaDLe: Deep Code Retrieval Based on Semantic Dependency Learning , 2020, Neural Networks.

[22]  S. Ji,et al.  Deep Graph Matching and Searching for Semantic Code Retrieval , 2020, ACM Trans. Knowl. Discov. Data.

[23]  Ming Zhou,et al.  GraphCodeBERT: Pre-training Code Representations with Data Flow , 2020, ICLR.

[24]  Lingxiao Jiang,et al.  Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations , 2020, SIGIR.

[25]  Joseph E. Gonzalez,et al.  Contrastive Code Representation Learning , 2020, EMNLP.

[26]  Maksym Andriushchenko,et al.  On the Stability of Fine-tuning BERT: Misconceptions, Explanations, and Strong Baselines , 2020, ICLR.

[27]  Gary D Bader,et al.  DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations , 2020, ACL.

[28]  Maik Riechert,et al.  Fast and Memory-Efficient Neural Code Completion , 2020, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[29]  Baishakhi Ray,et al.  Contrastive Learning for Source Code with Structural and Functional Properties , 2021, ArXiv.

[30]  Yiming Yang,et al.  On the Sentence Embeddings from BERT for Semantic Textual Similarity , 2020, EMNLP.

[31]  Beijun Shen,et al.  Learning Code-Query Interaction for Enhancing Code Searches , 2020, 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME).

[32]  Zeyu Sun,et al.  OCoR: An Overlapping-Aware Code Retriever , 2020, 2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[33]  Xin Xia,et al.  Improving Code Search with Co-Attentive Representation Learning , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[34]  Yanzhen Zou,et al.  Adaptive Deep Code Search , 2020, 2020 IEEE/ACM 28th International Conference on Program Comprehension (ICPC).

[35]  Hailong Sun,et al.  Retrieval-based Neural Source Code Summarization , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[36]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[37]  Shi Han,et al.  CoCoGUM: Contextual Code Summarization with Multi-Relational GNN on UMLs , 2020 .

[38]  Phillip Isola,et al.  Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere , 2020, ICML.

[39]  Jinjun Xiong,et al.  A Multi-Perspective Architecture for Semantic Code Search , 2020, ACL.

[40]  Wei Ye,et al.  Leveraging Code Generation to Improve Code Retrieval and Summarization via Dual Learning , 2020, WWW.

[41]  Ting Liu,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, FINDINGS.

[42]  Geoffrey E. Hinton,et al.  A Simple Framework for Contrastive Learning of Visual Representations , 2020, ICML.

[43]  Manuel Serrano,et al.  Replication package for , 2020, Artifact Digital Object Group.

[44]  Ross B. Girshick,et al.  Momentum Contrast for Unsupervised Visual Representation Learning , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[45]  Pengtao Xie,et al.  CERT: Contrastive Self-supervised Learning for Language Understanding , 2020, ArXiv.

[46]  Philip S. Yu,et al.  Multi-modal Attention Network Learning for Semantic Source Code Retrieval , 2019, 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[47]  Marc Brockschmidt,et al.  CodeSearchNet Challenge: Evaluating the State of Semantic Code Search , 2019, ArXiv.

[48]  Di He,et al.  Representation Degeneration Problem in Training Natural Language Generation Models , 2019, ICLR.

[49]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[50]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[51]  Collin McMillan,et al.  A Neural Model for Generating Natural Language Summaries of Program Subroutines , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[52]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[53]  Feng Xu,et al.  Commit Message Generation for Source Code Changes , 2019, IJCAI.

[54]  Ying Zou,et al.  Expanding Queries for Code Search Using Semantically Related API Class-names , 2018, IEEE Transactions on Software Engineering.

[55]  Oriol Vinyals,et al.  Representation Learning with Contrastive Predictive Coding , 2018, ArXiv.

[56]  Stella X. Yu,et al.  Unsupervised Feature Learning via Non-parametric Instance Discrimination , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[57]  Xiaodong Gu,et al.  Deep Code Search , 2018, 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE).

[58]  Nikos Komodakis,et al.  Unsupervised Representation Learning by Predicting Image Rotations , 2018, ICLR.

[59]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[60]  Graham W. Taylor,et al.  Improved Regularization of Convolutional Neural Networks with Cutout , 2017, ArXiv.

[61]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[62]  Xiaochen Li,et al.  Query Expansion Based on Crowd Knowledge for Code Search , 2016, IEEE Transactions on Services Computing.

[63]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[64]  Mira Mezini,et al.  Intelligent Code Completion with Bayesian Networks , 2015, ACM Trans. Softw. Eng. Methodol..

[65]  Dongmei Zhang,et al.  CodeHow: Effective Code Search Based on API Understanding and Extended Boolean Model (E) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[66]  Hal Daumé,et al.  Deep Unordered Composition Rivals Syntactic Methods for Text Classification , 2015, ACL.

[67]  Mario Linares Vásquez,et al.  ChangeScribe: A Tool for Automatically Generating Commit Messages , 2015, 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering.

[68]  David Lo,et al.  Query expansion via WordNet for effective code search , 2015, 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER).

[69]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[70]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[71]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[72]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[73]  Andrew G. Howard,et al.  Some Improvements on Deep Convolutional Neural Network Based Image Classification , 2013, ICLR.

[74]  Collin McMillan,et al.  Portfolio: finding relevant functions and their usage , 2011, 2011 33rd International Conference on Software Engineering (ICSE).

[75]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[76]  Mira Mezini,et al.  Learning from examples to improve code completion systems , 2009, ESEC/SIGSOFT FSE.

[77]  Steven P. Reiss,et al.  Semantics-based code search , 2009, 2009 IEEE 31st International Conference on Software Engineering.

[78]  Sushil Krishna Bajracharya,et al.  Sourcerer: mining and searching internet-scale software repositories , 2008, Data Mining and Knowledge Discovery.

[79]  Geoffrey E. Hinton,et al.  Visualizing Data using t-SNE , 2008 .

[80]  Yann LeCun,et al.  Dimensionality Reduction by Learning an Invariant Mapping , 2006, 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06).

[81]  Janice Singer,et al.  An examination of software engineering work practices , 2010, CASCON.

[82]  Stephen E. Robertson,et al.  Relevance weighting of search terms , 1976, J. Am. Soc. Inf. Sci..

[83]  P. Jaccard,et al.  Etude comparative de la distribution florale dans une portion des Alpes et des Jura , 1901 .