论文信息 - Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Open Vocabulary Learning on Source Code with a Graph-Structured Cache

Machine learning models that take computer program source code as input typically use Natural Language Processing (NLP) techniques. However, a major challenge is that code is written using an open, rapidly changing vocabulary due to, e.g., the coinage of new variable and method names. Reasoning over such a vocabulary is not something for which most NLP methods are designed. We introduce a Graph-Structured Cache to address this problem; this cache contains a node for each new word the model encounters with edges connecting each word to its occurrences in the code. We find that combining this graph-structured cache strategy with recent Graph-Neural-Network-based models for supervised learning on code improves the models' performance on a code completion task and a variable naming task --- with over $100\%$ relative improvement on the latter --- at the cost of a moderate increase in computation time.

[1] Richard Socher,et al. Pointer Sentinel Mixture Models , 2016, ICLR.

[2] Aditya Kanade,et al. Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[3] Yue Wang,et al. Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[4] Dawn Xiaodong Song,et al. Tree-to-tree Neural Networks for Program Translation , 2018, NeurIPS.

[5] Tomas Mikolov,et al. Enriching Word Vectors with Subword Information , 2016, TACL.

[6] Daniel D. Johnson,et al. Learning Graphical State Transitions , 2016, ICLR.

[7] Max Welling,et al. Modeling Relational Data with Graph Convolutional Networks , 2017, ESWC.

[8] Swarat Chaudhuri,et al. Neural Attribute Machines for Program Generation , 2017, ArXiv.

[9] Christopher Potts,et al. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[10] Christoph Goller,et al. Learning task-dependent distributed representations by backpropagation through structure , 1996, Proceedings of International Conference on Neural Networks (ICNN'96).

[11] Jan Vitek,et al. DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[12] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13] Oleksandr Polozov,et al. Generative Code Modeling with Graphs , 2018, ICLR.

[14] Jens Krinke,et al. Identifying similar code with program dependence graphs , 2001, Proceedings Eighth Working Conference on Reverse Engineering.

[15] Samuel S. Schoenholz,et al. Neural Message Passing for Quantum Chemistry , 2017, ICML.

[16] Maria Leonor Pacheco,et al. of the Association for Computational Linguistics: , 2001 .

[17] Premkumar T. Devanbu,et al. A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[18] Moustapha Cissé,et al. Unbounded cache model for online language modeling with open vocabulary , 2017, NIPS.

[19] Uri Alon,et al. A general path-based representation for predicting program properties , 2018, PLDI.

[20] Wojciech Zaremba,et al. Learning to Discover Efficient Mathematical Identities , 2014, NIPS.

[21] Marc Brockschmidt,et al. Learning to Represent Programs with Graphs , 2017, ICLR.

[22] Zhengdong Lu,et al. Object-oriented Neural Programming (OONP) for Document Understanding , 2017, ACL.

[23] Bowen Zhou,et al. Pointing the Unknown Words , 2016, ACL.

[24] Christopher D. Manning,et al. Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models , 2016, ACL.

[25] Richard S. Zemel,et al. Gated Graph Sequence Neural Networks , 2015, ICLR.

[26] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[27] Daniel Tarlow,et al. Structured Generative Models of Natural Source Code , 2014, ICML.

[28] van den Berg,et al. UvA-DARE (Digital Academic Modeling Relational Data with Graph Convolutional Networks Modeling Relational Data with Graph Convolutional Networks , 2017 .

[29] Andrew McCallum,et al. RelNet: End-to-end Modeling of Entities & Relations , 2017, AKBC@NIPS.

[30] Martin White,et al. Deep learning code fragments for code clone detection , 2016, 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE).

[31] Dan Klein,et al. Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[32] Alexander M. Rush,et al. Character-Aware Neural Language Models , 2015, AAAI.

[33] Andreas Krause,et al. Predicting Program Properties from "Big Code" , 2015, POPL.

[34] Dawn Song,et al. Neural Code Completion , 2017 .

[35] M. Osborne,et al. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1 , 2012 .

[36] Svetha Venkatesh,et al. Graph Memory Networks for Molecular Activity Prediction , 2018, 2018 24th International Conference on Pattern Recognition (ICPR).

[37] Alexandre Tkatchenko,et al. Quantum-chemical insights from deep tensor neural networks , 2016, Nature Communications.

[38] Hongseok Yang,et al. Automatically generating features for learning program analysis heuristics for C-like languages , 2017, Proc. ACM Program. Lang..

[39] Premkumar T. Devanbu,et al. Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[40] Earl T. Barr,et al. Learning Python Code Suggestion with a Sparse Pointer Network , 2016, ArXiv.

[41] Naoaki Okazaki,et al. Dynamic Entity Representation with Max-pooling Improves Machine Reading , 2016, NAACL.

[42] Xiang Zhang,et al. Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[43] Razvan Pascanu,et al. Relational inductive biases, deep learning, and graph networks , 2018, ArXiv.