Learning and Evaluating Contextual Embedding of Source Code

Recent research has achieved impressive results on understanding and improving source code by building up on machine-learning techniques developed for natural languages. A significant advancement in natural-language understanding has come with the development of pre-trained contextual embeddings, such as BERT, which can be fine-tuned for downstream tasks with less labeled data and training budget, while achieving better accuracies. However, there is no attempt yet to obtain a high-quality contextual embedding of source code, and to evaluate it on multiple program-understanding tasks simultaneously; that is the gap that this paper aims to mitigate. Specifically, first, we curate a massive, deduplicated corpus of 7.4M Python files from GitHub, which we use to pre-train CuBERT, an open-sourced code-understanding BERT model; and, second, we create an open-sourced benchmark that comprises five classification tasks and one program-repair task, akin to code-understanding tasks proposed in the literature before. We fine-tune CuBERT on our benchmark tasks, and compare the resulting models to different variants of Word2Vec token embeddings, BiLSTM and Transformer models, as well as published state-of-the-art models, showing that CuBERT outperforms them all, even with shorter training, and with fewer labeled examples. Future work on source-code embedding can benefit from reusing our benchmark, and from comparing against CuBERT models as a strong baseline.

[1]  Koushik Sen,et al.  When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[2]  Rahul Gupta,et al.  DeepFix: Fixing Common C Language Errors by Deep Learning , 2017, AAAI.

[3]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[4]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[5]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[6]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[7]  Samy Bengio,et al.  Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[8]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[9]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[10]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[11]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[12]  Martin Monperrus,et al.  A Literature Study of Embeddings on Source Code , 2019, ArXiv.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Annie Louis,et al.  Deep Learning to Detect Redundant Method Comments , 2018, ArXiv.

[15]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[16]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[17]  Robert C. Martin Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[18]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[19]  Charles A. Sutton,et al.  Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[20]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[21]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[22]  Tao Wang,et al.  Convolutional Neural Networks over Tree Structures for Programming Language Processing , 2014, AAAI.

[23]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[24]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[25]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[26]  Hailong Sun,et al.  A Novel Neural Source Code Representation Based on Abstract Syntax Tree , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[27]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[28]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[29]  Rico Sennrich,et al.  A Parallel Corpus of Python Functions and Documentation Strings for Automated Code Documentation and Code Generation , 2017, IJCNLP.

[30]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[31]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[32]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[33]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[34]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[35]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[36]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[37]  Armando Solar-Lezama,et al.  sk_p: a neural program corrector for MOOCs , 2016, SPLASH.

[38]  Richard Socher,et al.  Learned in Translation: Contextualized Word Vectors , 2017, NIPS.

[39]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[40]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[41]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[42]  Koushik Sen,et al.  DeepBugs: a learning approach to name-based bug detection , 2018, Proc. ACM Program. Lang..

[43]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[44]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[45]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.