Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over $100\%$ relative improvement on the latter --- at the cost of a moderate increase in computation time.

[1]  Richard Socher,et al.  Pointer Sentinel Mixture Models , 2016, ICLR.

[2]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[3]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[4]  Dawn Xiaodong Song,et al.  Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[5]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[6]  Daniel D. Johnson,et al.  Learning Graphical State Transitions , 2016, ICLR.

[7]  Max Welling,et al.  Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[8]  Swarat Chaudhuri,et al.  Neural Attribute Machines for Program Generation , 2017, ArXiv.

[9]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[10]  Christoph Goller,et al.  Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[11]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[14]  Jens Krinke,et al.  Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[15]  Samuel S. Schoenholz,et al.  Neural Message Passing for Quantum Chemistry , 2017, ICML.

[16]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[17]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[18]  Moustapha Cissé,et al.  Unbounded cache model for online language modeling with open vocabulary , 2017, NIPS.

[19]  Uri Alon,et al.  A general path-based representation for predicting program properties , 2018, PLDI.

[20]  Wojciech Zaremba,et al.  Learning to Discover Efficient Mathematical Identities , 2014, NIPS.

[21]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[22]  Zhengdong Lu,et al.  Object-oriented Neural Programming (OONP) for Document Understanding , 2017, ACL.

[23]  Bowen Zhou,et al.  Pointing the Unknown Words , 2016, ACL.

[24]  Christopher D. Manning,et al.  Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[25]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[26]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[28]  van den Berg,et al.  UvA-DARE (Digital Academic Modeling Relational Data with Graph Convolutional Networks Modeling Relational Data with Graph Convolutional Networks , 2017 .

[29]  Andrew McCallum,et al.  RelNet: End-to-end Modeling of Entities & Relations , 2017, AKBC@NIPS.

[30]  Martin White,et al.  Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[31]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[32]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[33]  Andreas Krause,et al.  Predicting Program Properties from "Big Code" , 2015, POPL.

[34]  Dawn Song,et al.  Neural Code Completion , 2017 .

[35]  M. Osborne,et al.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 , 2012 .

[36]  Svetha Venkatesh,et al.  Graph Memory Networks for Molecular Activity Prediction , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[37]  Alexandre Tkatchenko,et al.  Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[38]  Hongseok Yang,et al.  Automatically generating features for learning program analysis heuristics for C-like languages , 2017, Proc. ACM Program. Lang..

[39]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[40]  Earl T. Barr,et al.  Learning Python Code Suggestion with a Sparse Pointer Network , 2016, ArXiv.

[41]  Naoaki Okazaki,et al.  Dynamic Entity Representation with Max-pooling Improves Machine Reading , 2016, NAACL.

[42]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[43]  Razvan Pascanu,et al.  Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.