论文信息 - Pre-trained Contextual Embedding of Source Code

Pre-trained Contextual Embedding of Source Code

The source code of a program not only serves as a formal description of an executable task, but it also serves to communicate developer intent in a human-readable form. To facilitate this, developers use meaningful identifier names and natural-language documentation. This makes it possible to successfully apply sequence-modeling approaches, shown to be effective in natural-language processing, to source code. A major advancement in natural-language understanding has been the use of pre-trained token embeddings; BERT and other works have further shown that pre-trained contextual embeddings can be extremely powerful and can be fine-tuned effectively for a variety of downstream supervised tasks. Inspired by these developments, we present the first attempt to replicate this success on source code. We curate a massive corpus of Python programs from GitHub to pre-train a BERT model, which we call Code Understanding BERT (CuBERT). We also pre-train Word2Vec embeddings on the same dataset. We create a benchmark of five classification tasks and compare fine-tuned CuBERT against sequence models trained with and without the Word2Vec embeddings. Our results show that CuBERT outperforms the baseline methods by a margin of 2.9-22%. We also show its superiority when fine-tuned with smaller datasets, and over fewer epochs. We further evaluate CuBERT's effectiveness on a joint classification, localization and repair task involving prediction of two pointers.

[1] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[2] Koushik Sen,et al. When deep learning met code search , 2019, ESEC/SIGSOFT FSE.

[3] Mike Schuster,et al. Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[4] Jeffrey Dean,et al. Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[5] Xiaodong Gu,et al. Deep API learning , 2016, SIGSOFT FSE.

[6] Robert C. Martin. Clean Code - a Handbook of Agile Software Craftsmanship , 2008 .

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Petr Sojka,et al. Software Framework for Topic Modelling with Large Corpora , 2010 .

[9] Samy Bengio,et al. Tensor2Tensor for Neural Machine Translation , 2018, AMTA.

[10] Martin T. Vechev,et al. Probabilistic model for code with decision trees , 2016, OOPSLA.

[11] Richard S. Zemel,et al. Gated Graph Sequence Neural Networks , 2015, ICLR.

[12] Jeffrey Dean,et al. Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13] Charles A. Sutton,et al. Suggesting accurate method and class names , 2015, ESEC/SIGSOFT FSE.

[14] Alvin Cheung,et al. Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[15] Armando Solar-Lezama,et al. sk_p: a neural program corrector for MOOCs , 2016, SPLASH.