LEVER: Learning to Verify Language-to-Code Generation with Execution

The advent of pre-trained code language models (CodeLMs) has lead to significant progress in language-to-code generation. State-of-the-art approaches in this area combine CodeLM decoding with sample pruning and reranking using test cases or heuristics based on the execution results. However, it is challenging to obtain test cases for many real-world language-to-code applications, and heuristics cannot well capture the semantic features of the execution results, such as data type and value range, which often indicates the correctness of the program. In this work, we propose LEVER, a simple approach to improve language-to-code generation by learning to verify the generated programs with their execution results. Specifically, we train verifiers to determine whether a program sampled from the CodeLM is correct or not based on the natural language input, the program itself and its execution results. The sampled programs are reranked by combining the verification score with the CodeLM generation probability, and marginalizing over programs with the same execution results. On four datasets across the domains of table QA, math QA and basic Python programming, LEVER consistently improves over the base CodeLMs (4.6% to 10.9% with code-davinci-002) and achieves new state-of-the-art results on all of them.

[1]  Sida I. Wang,et al.  Coder Reviewer Reranking for Code Generation , 2022, ICML.

[2]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[3]  Yejin Choi,et al.  Generating Sequences by Learning to Self-Correct , 2022, ICLR.

[4]  Dragomir R. Radev,et al.  Binding Language Models in Symbolic Languages , 2022, ICLR.

[5]  Weizhu Chen,et al.  CodeT: Code Generation with Generated Tests , 2022, ICLR.

[6]  Tom B. Brown,et al.  Language Models (Mostly) Know What They Know , 2022, ArXiv.

[7]  Graham Neubig,et al.  OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering , 2022, NAACL.

[8]  Weizhu Chen,et al.  On the Advance of Making Language Models Better Reasoners , 2022, ArXiv.

[9]  Dragomir R. Radev,et al.  Learning from Self-Sampled Correct and Partially-Correct Programs , 2022, ArXiv.

[10]  Fan Zhou,et al.  TaCube: Pre-computing Data Cubes for Answering Numerical-Reasoning Questions over Tabular Data , 2022, EMNLP.

[11]  Zhouhan Lin,et al.  RASAT: Integrating Relational Structures into Pretrained Seq2Seq Model for Text-to-SQL , 2022, EMNLP.

[12]  Sida I. Wang,et al.  Natural Language to Code Translation with Execution , 2022, EMNLP.

[13]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[14]  Dzmitry Bahdanau,et al.  Evaluating the Text-to-SQL Capabilities of Large Language Models , 2022, ArXiv.

[15]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[16]  Dragomir R. Radev,et al.  UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models , 2022, EMNLP.

[17]  Graham Neubig,et al.  Hierarchical Control of Situated Agents through Natural Language , 2022, SUKI.

[18]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[19]  Qian Liu,et al.  TAPEX: Table Pre-training via Learning a Neural SQL Executor , 2021, ICLR.

[20]  S. Savarese,et al.  A Conversational Paradigm for Program Synthesis , 2022, ArXiv.

[21]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[22]  Dzmitry Bahdanau,et al.  PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models , 2021, EMNLP.

[23]  Lifeng Shang,et al.  Generate & Rank: A Multi-task Framework for Math Word Problems , 2021, EMNLP.

[24]  Yue Wang,et al.  CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation , 2021, EMNLP.

[25]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[26]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[27]  Dawn Song,et al.  Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages , 2021, NeurIPS.

[28]  Neel Sundaresan,et al.  Unit Test Case Generation with Transformers , 2020, ArXiv.

[29]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[30]  Xiaodong Liu,et al.  RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers , 2019, ACL.

[31]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[32]  Luke Zettlemoyer,et al.  JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation , 2019, EMNLP.

[33]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[34]  Dawn Song,et al.  Execution-Guided Neural Program Synthesis , 2018, ICLR.

[35]  Erik T. Mueller,et al.  Multi-turn Dialogue Response Generation in an Adversarial Learning Framework , 2018, Proceedings of the First Workshop on NLP for Conversational AI.

[36]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[37]  Tao Yu,et al.  Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task , 2018, EMNLP.

[38]  Mirella Lapata,et al.  Weakly-Supervised Neural Semantic Parsing with a Generative Ranker , 2018, CoNLL.

[39]  Rishabh Singh,et al.  Robust Text-to-SQL Generation with Execution-Guided Decoding , 2018, 1807.03100.

[40]  Michael D. Ernst,et al.  NL2Bash: A Corpus and Semantic Parser for Natural Language Interface to the Linux Operating System , 2018, LREC.

[41]  Jonathan Berant,et al.  Weakly Supervised Semantic Parsing with Abstract Examples , 2017, ACL.

[42]  Richard Socher,et al.  Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning , 2018, ArXiv.

[43]  Percy Liang,et al.  From Language to Programs: Bridging Reinforcement Learning and Maximum Marginal Likelihood , 2017, ACL.

[44]  Dan Klein,et al.  Abstract Syntax Networks for Code Generation and Semantic Parsing , 2017, ACL.

[45]  Claire Gardent,et al.  Sequence-based Structured Prediction for Semantic Parsing , 2016, ACL.

[46]  Percy Liang,et al.  Compositional Semantic Parsing on Semi-Structured Tables , 2015, ACL.

[47]  Xiaojun Wan,et al.  Multi-Document Summarization via Discriminative Summary Reranking , 2015, ArXiv.

[48]  Sumit Gulwani,et al.  NLyze: interactive programming by natural language for spreadsheet data analysis and manipulation , 2014, SIGMOD Conference.

[49]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[50]  Luke S. Zettlemoyer,et al.  Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions , 2013, TACL.

[51]  Luke S. Zettlemoyer,et al.  Learning to Map Sentences to Logical Form: Structured Classification with Probabilistic Categorial Grammars , 2005, UAI.

[52]  Michael Collins,et al.  Discriminative Reranking for Natural Language Parsing , 2000, CL.

[53]  Anoop Sarkar,et al.  Discriminative Reranking for Machine Translation , 2004, NAACL.

[54]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[55]  Raymond J. Mooney,et al.  Learning to Parse Database Queries Using Inductive Logic Programming , 1996, AAAI/IAAI, Vol. 2.

[56]  William A. Woods,et al.  Progress in natural language understanding: an application to lunar geology , 1973, AFIPS National Computer Conference.