论文信息 - Learning and Evaluating Contextual Embedding of Source Code

Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

Aditya Kanade | Petros Maniatis | Gogul Balakrishnan | Kensen Shi

[1] Koushik Sen,et al. When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[2] Rahul Gupta,et al. DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[3] Xiaodong Gu,et al. Deep API learning , 2016, SIGSOFT FSE.

[4] Uri Alon,et al. code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[5] Premkumar T. Devanbu,et al. On the naturalness of software , 2016, Commun. ACM.

[6] Eran Yahav,et al. Code completion with statistical language models , 2014, PLDI.

[7] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[8] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9] Aditya Kanade,et al. Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[10] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11] Yiming Yang,et al. XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[12] Martin Monperrus,et al. A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Annie Louis,et al. Deep Learning to Detect Redundant Method Comments , 2018, ArXiv.

[15] Miltiadis Allamanis,et al. The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[16] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[17] Robert C. Martin. Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[18] Martin Wattenberg,et al. Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[19] Charles A. Sutton,et al. Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[20] Alvin Cheung,et al. Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[21] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[22] Tao Wang,et al. Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[23] Ilya Sutskever,et al. Language Models are Unsupervised Multitask Learners , 2019 .

[24] Marc Brockschmidt,et al. Learning to Represent Programs with Graphs , 2017, ICLR.

[25] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26] Hailong Sun,et al. A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[27] Ashish Vaswani,et al. Self-Attention with Relative Position Representations , 2018, NAACL.

[28] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[29] Rico Sennrich,et al. A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[30] Rishabh Singh,et al. Global Relational Models of Source Code , 2020, ICLR.

[31] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32] Jeffrey Pennington,et al. GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[34] Martin T. Vechev,et al. Probabilistic model for code with decision trees , 2016, OOPSLA.

[35] Tomoki Toda,et al. Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36] Richard S. Zemel,et al. Gated Graph Sequence Neural Networks , 2015, ICLR.

[37] Armando Solar-Lezama,et al. sk_p: a neural program corrector for MOOCs , 2016, SPLASH.

[38] Richard Socher,et al. Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[39] 知秀柴田. 5分で分かる!? 有名論文ナナメ読み：Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[40] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41] Xiaocheng Feng,et al. CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[42] Koushik Sen,et al. DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[43] Alec Radford,et al. Improving Language Understanding by Generative Pre-Training , 2018 .

[44] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[45] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.