Incorporating External Knowledge through Pre-training for Natural Language to Code Generation

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation: automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2% absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at this https URL.

[1]  Mirella Lapata,et al.  Language to Logical Form with Neural Attention , 2016, ACL.

[2]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[3]  Daniel S. Weld,et al.  StaQC: A Systematically Mined Question-Code Dataset from Stack Overflow , 2018, WWW.

[4]  Claire Gardent,et al.  Sequence-based Structured Prediction for Semantic Parsing , 2016, ACL.

[5]  Stephen E. Robertson,et al.  A probabilistic model of information retrieval: development and comparative experiments - Part 1 , 2000, Inf. Process. Manag..

[6]  Philip J. Guo,et al.  Two studies of opportunistic programming: interleaving web foraging, learning, and writing code , 2009, CHI.

[7]  Michael D. Ernst,et al.  NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , 2018, LREC.

[8]  Mirella Lapata,et al.  Coarse-to-Fine Decoding for Neural Semantic Parsing , 2018, ACL.

[9]  Luyao Chen,et al.  CoSQL: A Conversational Text-to-SQL Challenge Towards Cross-Domain Natural Language Interfaces to Databases , 2019, EMNLP.

[10]  Wang Ling,et al.  Latent Predictor Networks for Code Generation , 2016, ACL.

[11]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[12]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[13]  Tao Yu,et al.  SParC: Cross-Domain Semantic Parsing in Context , 2019, ACL.

[14]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[15]  Graham Neubig,et al.  TRANX: A Transition-based Neural Abstract Syntax Parser for Semantic Parsing and Code Generation , 2018, EMNLP.

[16]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[17]  Xiaodong Gu,et al.  Deep API learning , 2016, SIGSOFT FSE.

[18]  Jayant Krishnamurthy,et al.  Neural Semantic Parsing with Type Constraints for Semi-Structured Tables , 2017, EMNLP.

[19]  Yoav Artzi,et al.  Learning to Map Context-Dependent Sentences to Executable Formal Queries , 2018, NAACL.

[20]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[21]  Alvin Cheung,et al.  Mapping Language to Code in Programmatic Context , 2018, EMNLP.

[22]  Alexander I. Rudnicky,et al.  Expanding the Scope of the ATIS Task: The ATIS-3 Corpus , 1994, HLT.

[23]  Raymond J. Mooney,et al.  Language to Code: Learning Semantic Parsers for If-This-Then-That Recipes , 2015, ACL.

[24]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[25]  Chen Liang,et al.  Neural Symbolic Machines: Learning Semantic Parsers on Freebase with Weak Supervision , 2016, ACL.

[26]  Scott R. Klemmer,et al.  Example-centric programming: integrating web search into the development environment , 2010, CHI.

[27]  Luke Zettlemoyer,et al.  JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation , 2019, EMNLP.

[28]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[29]  Graham Neubig,et al.  Reranking for Neural Semantic Parsing , 2019, ACL.

[30]  Huan Sun,et al.  CoaCor: Code Annotation for Code Retrieval with Reinforcement Learning , 2019, WWW.