Code Generation from Supervised Code Embeddings

Code generation, which generates source code from natural language, is beneficial for constructing smarter Integrated Development Environments (IDEs), retrieving code more effectively and so on. Traditional approaches are based on matching similar code snippets, and recently researchers pay more attention to machine learning, especially the encoder-decoder framework. Faced with code generation, most encoder-decoder frameworks suffer from two drawbacks: (a) The length of the code snippet is always much longer than the length of its corresponding natural language, which makes it hard to align them, especially for encoders at word level; (b) Code snippets with the same functionality could be implemented in various ways, even completely different at word level. For drawback (a), we propose a new Supervised Code Embedding (SCE) model to promote the alignment between natural language and code. For drawback (b), with the help of Abstract Syntax Tree (AST), we propose a new distributed representation of code snippets which overcomes this drawback. To evaluate our approaches, we build a variant of the encoder-decoder model to generates code with the help of pre-trained code embedding. We perform experiments on several open source datasets. The experiment results indicate that our approaches are effective and outperform the state-of-the-art.

[1]  Andrew D. Gordon,et al.  Bimodal Modelling of Source Code and Natural Language , 2015, ICML.

[2]  Sebastian Nowozin,et al.  DeepCoder: Learning to Write Programs , 2016, ICLR.

[3]  Uri Alon,et al.  code2vec: learning distributed representations of code , 2018, Proc. ACM Program. Lang..

[4]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[5]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[6]  Dan Klein,et al.  Learning Dependency-Based Compositional Semantics , 2011, CL.

[7]  Regina Barzilay,et al.  Neural Generation of Regular Expressions from Natural Language with Minimal Domain Knowledge , 2016, EMNLP.

[8]  Omer Levy,et al.  code2seq: Generating Sequences from Structured Representations of Code , 2018, ICLR.

[9]  Eelco Visser,et al.  Code generation by model transformation: a case study in transformation modularity , 2008, Software & Systems Modeling.

[10]  Tomoki Toda,et al.  Learning to Generate Pseudo-Code from Source Code Using Statistical Machine Translation (T) , 2015, 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE).

[11]  Marc Pantel,et al.  Model-based formal specification of a DSL library for a qualified code generator , 2012, OCL '12.

[12]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[13]  Raymond J. Mooney,et al.  Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes , 2015, ACL.

[14]  Daniel Gildea,et al.  Integrating Programming by Example and Natural Language Programming , 2013, AAAI.

[15]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.