Learning Python Code Suggestion with a Sparse Pointer Network

To enhance developer productivity, all modern integrated development environments (IDEs) include code suggestion functionality that proposes likely next tokens at the cursor. While current IDEs work well for statically-typed languages, their reliance on type annotations means that they do not provide the same level of support for dynamic programming languages as for statically-typed languages. Moreover, suggestion engines in modern IDEs do not propose expressions or multi-statement idiomatic code. Recent work has shown that language models can improve code suggestion systems by learning from software repositories. This paper introduces a neural language model with a sparse pointer network aimed at capturing very long range dependencies. We release a large-scale code suggestion corpus of 41M lines of Python code crawled from GitHub. On this corpus, we found standard neural language models to perform well at suggesting local phenomena, but struggle to refer to identifiers that are introduced many tokens in the past. By augmenting a neural language model with a pointer network specialized in referring to predefined classes of identifiers, we obtain a much lower perplexity and a 5 percentage points increase in accuracy for code suggestion compared to an LSTM baseline. In fact, this increase in code suggestion accuracy is due to a 13 times more accurate prediction of identifiers. Furthermore, a qualitative analysis shows this model indeed captures interesting long-range dependencies, like referring to a class member defined over 60 tokens in the past.

[1]  Paul J. Werbos,et al.  Backpropagation Through Time: What It Does and How to Do It , 1990, Proc. IEEE.

[2]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[3]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[4]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[5]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[6]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[7]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[8]  Wojciech Zaremba,et al.  An Empirical Exploration of Recurrent Network Architectures , 2015, ICML.

[9]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[10]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[11]  Mirella Lapata,et al.  Long Short-Term Memory-Networks for Machine Reading , 2016, EMNLP.

[12]  Phil Blunsom,et al.  Teaching Machines to Read and Comprehend , 2015, NIPS.

[13]  Premkumar T. Devanbu,et al.  On the localness of software , 2014, SIGSOFT FSE.

[14]  Nathan Schneider,et al.  Association for Computational Linguistics: Human Language Technologies , 2011 .

[15]  Subhasis Das,et al.  Contextual Code Completion Using Machine Learning , 2015 .

[16]  Xinyun Chen Under Review as a Conference Paper at Iclr 2017 Delving into Transferable Adversarial Ex- Amples and Black-box Attacks , 2016 .

[17]  Alex Graves,et al.  Generating Sequences With Recurrent Neural Networks , 2013, ArXiv.

[18]  Yoshua Bengio,et al.  On Using Very Large Target Vocabulary for Neural Machine Translation , 2014, ACL.

[19]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[21]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[22]  Martin White,et al.  Toward Deep Learning Software Repositories , 2015, 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories.

[23]  Charles A. Sutton,et al.  Learning natural coding conventions , 2014, SIGSOFT FSE.

[24]  Charles A. Sutton,et al.  Mining source code repositories at massive scale using language modeling , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[25]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[26]  Geoffrey E. Hinton,et al.  Grammar as a Foreign Language , 2014, NIPS.

[27]  Christof Monz,et al.  Recurrent Memory Networks for Language Modeling , 2016, NAACL.

[28]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[29]  Truyen Tran,et al.  A deep language model for software code , 2016, FSE 2016.