Learning to Generate Code Sketches

Traditional generative models are limited to predicting sequences of terminal tokens. However, ambiguities in the generation task may lead to incorrect outputs. Towards addressing this, we introduce GRAMMFORMERs, transformer-based grammarguided models that learn (without explicit supervision) to generate sketches — sequences of tokens with holes. Through reinforcement learning, GRAMMFORMERs learn to introduce holes avoiding the generation of incorrect tokens where there is ambiguity in the target task. We train GRAMMFORMERs for statement-level source code completion, i.e. the generation of code snippets given an ambiguous user intent, such as a partial code context. We evaluate GRAMMFORMERs on code completion for C# and Python and show that it generates 10-50% more accurate sketches compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.

[1]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[2]  Mohammad Norouzi,et al.  The Importance of Generation Order in Language Modeling , 2018, EMNLP.

[3]  Zhifang Sui,et al.  A Token-level Reference-free Hallucination Detection Benchmark for Free-form Text Generation , 2021, ArXiv.

[4]  Richard Socher,et al.  Sketch-Fill-A-R: A Persona-Grounded Chit-Chat Generation Framework , 2019, NLP4CONVAI.

[5]  Charles A. Sutton,et al.  Mining idioms from source code , 2014, SIGSOFT FSE.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Martin T. Vechev,et al.  PHOG: Probabilistic Model for Code , 2016, ICML.

[8]  Vaibhava Goel,et al.  Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Premkumar T. Devanbu,et al.  On the naturalness of software , 2016, Commun. ACM.

[10]  Armando Solar-Lezama,et al.  Learning to Infer Program Sketches , 2019, ICML.

[11]  Miltiadis Allamanis,et al.  The adverse effects of code duplication in machine learning models of code , 2018, Onward!.

[12]  Xiaocheng Feng,et al.  CodeBERT: A Pre-Trained Model for Programming and Natural Languages , 2020, EMNLP.

[13]  Graham Neubig,et al.  A Syntactic Neural Model for General-Purpose Code Generation , 2017, ACL.

[14]  David Barber,et al.  Generating Sentences Using a Dynamic Canvas , 2018, AAAI.

[15]  Lihong Li,et al.  Neuro-Symbolic Program Synthesis , 2016, ICLR.

[16]  Aliaksei Severyn,et al.  Encode, Tag, Realize: High-Precision Text Editing , 2019, EMNLP.

[17]  Andrea Janes,et al.  Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code , 2020, 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE).

[18]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[19]  Eric Horvitz,et al.  Principles of mixed-initiative user interfaces , 1999, CHI '99.

[20]  Richard Socher,et al.  A Deep Reinforced Model for Abstractive Summarization , 2017, ICLR.

[21]  Kyunghyun Cho,et al.  Non-Monotonic Sequential Text Generation , 2019, ICML.

[22]  Jan Vitek,et al.  DéjàVu: a map of code duplicates on GitHub , 2017, Proc. ACM Program. Lang..

[23]  Arman Cohan,et al.  Longformer: The Long-Document Transformer , 2020, ArXiv.

[24]  Rishabh Singh,et al.  Global Relational Models of Source Code , 2020, ICLR.

[25]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.

[26]  Premkumar T. Devanbu,et al.  Studying the difference between natural and programming language corpora , 2019, Empirical Software Engineering.

[27]  Neel Sundaresan,et al.  Pythia: AI-assisted Code Completion System , 2019, KDD.

[28]  Jakob Uszkoreit,et al.  Insertion Transformer: Flexible Sequence Generation via Insertion Operations , 2019, ICML.

[29]  Ryan McDonald,et al.  On Faithfulness and Factuality in Abstractive Summarization , 2020, ACL.

[30]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[31]  Daniel Tarlow,et al.  Structured Generative Models of Natural Source Code , 2014, ICML.

[32]  Mirella Lapata,et al.  Data-to-Text Generation with Content Selection and Planning , 2018, AAAI.

[33]  Neel Sundaresan,et al.  IntelliCode compose: code generation using transformer , 2020, ESEC/SIGSOFT FSE.

[34]  Karl Stratos,et al.  Spectral Learning of Latent-Variable PCFGs , 2012, ACL.

[35]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2020, ACL.

[36]  Chin-Yew Lin,et al.  ROUGE: A Package for Automatic Evaluation of Summaries , 2004, ACL 2004.

[37]  Satish Chandra,et al.  Code Prediction by Feeding Trees to Transformers , 2020, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE).

[38]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[39]  Ruslan Salakhutdinov,et al.  Deep Gamblers: Learning to Abstain with Portfolio Theory , 2019, NeurIPS.