Code Prediction by Feeding Trees to Transformers

Code prediction, more specifically autocomplete, has become an essential feature in modern IDEs. Autocomplete is more effective when the desired next token is at (or close to) the top of the list of potential completions offered by the IDE at cursor position. This is where the strength of the underlying machine learning system that produces a ranked order of potential completions comes into play. We advance the state-of-the-art in the accuracy of code prediction (next token prediction) used in autocomplete systems. Our work uses Transformers as the base neural architecture. We show that by making the Transformer architecture aware of the syntactic structure of code, we increase the margin by which a Transformer-based system outperforms previous systems. With this, it outperforms the accuracy of several state-of-the-art next token prediction systems by margins ranging from 14% to 18%. We present in the paper several ways of communicating the code structure to the Transformer, which is fundamentally built for processing sequence data. We provide a comprehensive experimental evaluation of our proposal, along with alternative design choices, on a standard Python dataset, as well as on Facebook internal Python corpus. Our code and data preparation pipeline will be available in open source.

[1]  Maik Riechert,et al.  Fast and Memory-Efficient Neural Code Completion , 2020, 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR).

[2]  Anima Anandkumar,et al.  Open Vocabulary Learning on Source Code with a Graph-Structured Cache , 2018, ICML.

[3]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[4]  Somesh Jha,et al.  Semantic Robustness of Models of Source Code , 2020, 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER).

[5]  Hideaki Hata,et al.  Predicting Defective Lines Using a Model-Agnostic Technique , 2020, IEEE Transactions on Software Engineering.

[6]  Xiaodong Liu,et al.  Unified Language Model Pre-training for Natural Language Understanding and Generation , 2019, NeurIPS.

[7]  Eran Yahav,et al.  Code completion with statistical language models , 2014, PLDI.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Marc Brockschmidt,et al.  Structured Neural Summarization , 2018, ICLR.

[10]  Eran Yahav,et al.  Adversarial examples for models of code , 2019, Proc. ACM Program. Lang..

[11]  Alvin Cheung,et al.  Summarizing Source Code using a Neural Attention Model , 2016, ACL.

[12]  Chris Quirk,et al.  Novel positional encodings to enable tree-based transformers , 2019, NeurIPS.

[13]  Hermann Ney,et al.  Language Modeling with Deep Transformers , 2019, INTERSPEECH.

[14]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[15]  Fang Liu,et al.  Modeling Programs Hierarchically with Stack-Augmented LSTM , 2020, J. Syst. Softw..

[16]  Dawn Song,et al.  Neural Code Completion , 2017 .

[17]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[18]  Jaime S. Cardoso,et al.  Machine Learning Interpretability: A Survey on Methods and Metrics , 2019, Electronics.

[19]  Charles A. Sutton,et al.  A Convolutional Attention Network for Extreme Summarization of Source Code , 2016, ICML.

[20]  Richard S. Zemel,et al.  Gated Graph Sequence Neural Networks , 2015, ICLR.

[21]  Quanshi Zhang,et al.  Visual interpretability for deep learning: a survey , 2018, Frontiers of Information Technology & Electronic Engineering.

[22]  Gail E. Kaiser,et al.  Sequence Model Design for Code Completion in the Modern IDE , 2020, ArXiv.

[23]  Shafiq R. Joty,et al.  Tree-structured Attention with Hierarchical Accumulation , 2020, ICLR.

[24]  Carlos Guestrin,et al.  "Why Should I Trust You?": Explaining the Predictions of Any Classifier , 2016, ArXiv.

[25]  Aditya Kanade,et al.  Neural Program Repair by Jointly Learning to Localize and Repair , 2019, ICLR.

[26]  Cristina V. Lopes,et al.  Collective Intelligence for Smarter API Recommendations in Python , 2016, 2016 IEEE 16th International Working Conference on Source Code Analysis and Manipulation (SCAM).

[27]  Daniel Kroening,et al.  A survey of safety and trustworthiness of deep neural networks: Verification, testing, adversarial attack and defence, and interpretability , 2018, Comput. Sci. Rev..

[28]  Martin T. Vechev,et al.  Probabilistic model for code with decision trees , 2016, OOPSLA.

[29]  Quan Z. Sheng,et al.  Adversarial Attacks on Deep Learning Models in Natural Language Processing: A Survey , 2019 .

[30]  Mira Mezini,et al.  Intelligent Code Completion with Bayesian Networks , 2015, ACM Trans. Softw. Eng. Methodol..

[31]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[32]  Kai-Wei Chang,et al.  Building Language Models for Text with Named Entities , 2018, ACL.

[33]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[34]  Oleksandr Polozov,et al.  Generative Code Modeling with Graphs , 2018, ICLR.

[35]  Neel Sundaresan,et al.  Pythia: AI-assisted Code Completion System , 2019, KDD.

[36]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[37]  Uri Alon,et al.  Structural Language Models for Any-Code Generation , 2019, ArXiv.

[38]  Debdeep Mukhopadhyay,et al.  Adversarial Attacks and Defences: A Survey , 2018, ArXiv.

[39]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[40]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[41]  Premkumar T. Devanbu,et al.  A Survey of Machine Learning for Big Code and Naturalness , 2017, ACM Comput. Surv..

[42]  Navdeep Jaitly,et al.  Pointer Networks , 2015, NIPS.

[43]  Anh Tuan Nguyen,et al.  A statistical semantic language model for source code , 2013, ESEC/FSE 2013.

[44]  Robert E. Mercer,et al.  You Only Need Attention to Traverse Trees , 2019, ACL.

[45]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[46]  Yixiao Yang,et al.  Improve Language Modelling for Code Completion through Learning General Token Repetition of Source Code , 2019, SEKE.

[47]  Hung-yi Lee,et al.  Tree Transformer: Integrating Tree Structures into Self-Attention , 2019, EMNLP.

[48]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[49]  Hoa Khanh Dam,et al.  An Empirical Study of Model-Agnostic Techniques for Defect Prediction Models , 2020, IEEE Transactions on Software Engineering.

[50]  Premkumar T. Devanbu,et al.  Are deep neural networks the best choice for modeling source code? , 2017, ESEC/SIGSOFT FSE.

[51]  Ajmal Mian,et al.  Threat of Adversarial Attacks on Deep Learning in Computer Vision: A Survey , 2018, IEEE Access.

[52]  Noah Constant,et al.  Character-Level Language Modeling with Deeper Self-Attention , 2018, AAAI.

[53]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[54]  Alun D. Preece,et al.  Interpretability of deep learning models: A survey of results , 2017, 2017 IEEE SmartWorld, Ubiquitous Intelligence & Computing, Advanced & Trusted Computed, Scalable Computing & Communications, Cloud & Big Data Computing, Internet of People and Smart City Innovation (SmartWorld/SCALCOM/UIC/ATC/CBDCom/IOP/SCI).

[55]  Harald C. Gall,et al.  When Code Completion Fails: A Case Study on Real-World Completions , 2019, 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE).

[56]  Marc Brockschmidt,et al.  Learning to Represent Programs with Graphs , 2017, ICLR.

[57]  Omer Levy,et al.  Structural Language Models of Code , 2019, ICML.

[58]  Andrew Zisserman,et al.  Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps , 2013, ICLR.

[59]  Yue Wang,et al.  Code Completion with Neural Attention and Pointer Networks , 2017, IJCAI.

[60]  Martin T. Vechev,et al.  Program Synthesis for Character Level Language Modeling , 2016, ICLR.

[61]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[62]  Andreas Krause,et al.  Learning programs from noisy data , 2016, POPL.

[63]  Peter Chin,et al.  Tree-Transformer: A Transformer-Based Method for Correction of Tree-Structured Data , 2019, ArXiv.

[64]  Ankur Taly,et al.  Axiomatic Attribution for Deep Networks , 2017, ICML.

[65]  Martin Wattenberg,et al.  SmoothGrad: removing noise by adding noise , 2017, ArXiv.

[66]  Mihai Christodorescu,et al.  COSET: A Benchmark for Evaluating Neural Program Embeddings , 2019, ArXiv.