Making Language Models Better Reasoners with Step-Aware Verifier

Few-shot learning is a challenging task that requires language models to generalize from limited examples. Large language models like GPT-3 and PaLM have made impressive progress in this area, but they still face difficulties in reasoning tasks such as GSM8K, a benchmark for arithmetic problems. To improve their reasoning skills, previous work has proposed to guide the language model with prompts that elicit a series of reasoning steps before giving the final answer, achieving a significant improvement on GSM8K from 17.9% to 58.1% in problem-solving rate. In this paper, we present DiVeRSe (Diverse Verifier on Reasoning Step), a novel approach that further enhances the reasoning capability of language models. DiVeRSe has three main components: first, it generates diverse prompts to explore different reasoning paths for the same question; second, it uses a verifier to filter out incorrect answers based on a weighted voting scheme; and third, it verifies each reasoning step individually instead of the whole chain. We evaluate DiVeRSe on the latest language model code-davinci-002 and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks (e.g., GSM8K 74.4% to 83.2%).

[1]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[2]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[3]  I. Higgins,et al.  Selection-Inference: Exploiting Large Language Models for Interpretable Logical Reasoning , 2022, ICLR.

[4]  Xiting Wang,et al.  Multi-level Recommendation Reasoning over Knowledge Graphs with Reinforcement Learning , 2022, WWW.

[5]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[6]  James L. McClelland,et al.  Can language models learn from explanations in context? , 2022, EMNLP.

[7]  Noah D. Goodman,et al.  STaR: Bootstrapping Reasoning With Reasoning , 2022, NeurIPS.

[8]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[9]  M. Lewis,et al.  Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , 2022, Conference on Empirical Methods in Natural Language Processing.

[10]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[11]  Weizhu Chen,et al.  Reasoning Like Program Executors , 2022, EMNLP.

[12]  Jonathan Berant,et al.  Learning To Retrieve Prompts for In-Context Learning , 2021, NAACL.

[13]  Xuedong Huang,et al.  Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention , 2021, IJCAI.

[14]  M. Lewis,et al.  MetaICL: Learning to Learn In Context , 2021, NAACL.

[15]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[16]  Weizhu Chen,et al.  A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models , 2021, ACL.

[17]  Graham Neubig,et al.  Towards a Unified View of Parameter-Efficient Transfer Learning , 2021, ICLR.

[18]  Alyssa Lees,et al.  ReasonBERT: Pre-trained to Reason with Distant Supervision , 2021, EMNLP.

[19]  Lifeng Shang,et al.  Generate & Rank: A Multi-task Framework for Math Word Problems , 2021, EMNLP.

[20]  Jonathan Berant,et al.  Turning Tables: Generating Examples from Semi-structured Tables for Endowing Language Models with Reasoning Skills , 2021, ACL.

[21]  Yelong Shen,et al.  LoRA: Low-Rank Adaptation of Large Language Models , 2021, ICLR.

[22]  Fuli Feng,et al.  TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance , 2021, ACL.

[23]  Nan Duan,et al.  Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text , 2021, FINDINGS.

[24]  S. Riedel,et al.  Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity , 2021, ACL.

[25]  Luke Zettlemoyer,et al.  Surface Form Competition: Why the Highest Probability Answer Isn’t Always Right , 2021, EMNLP.

[26]  Alexander M. Rush,et al.  How many data points is a prompt worth? , 2021, NAACL.

[27]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[28]  D. Klein,et al.  Calibrate Before Use: Improving Few-Shot Performance of Language Models , 2021, ICML.

[29]  Weizhu Chen,et al.  What Makes Good In-Context Examples for GPT-3? , 2021, DEELIO.

[30]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[31]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[32]  Yue Zhang,et al.  LogiQA: A Challenge Dataset for Machine Reading Comprehension with Logical Reasoning , 2020, IJCAI.

[33]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[34]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[35]  Monica S. Lam,et al.  Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking , 2020, ACL.

[36]  Jun Yan,et al.  Scalable Multi-Hop Relational Reasoning for Knowledge-Aware Question Answering , 2020, EMNLP.

[37]  Wenhu Chen,et al.  HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data , 2020, FINDINGS.

[38]  Hannaneh Hajishirzi,et al.  Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[39]  Jonathan Berant,et al.  Injecting Numerical Reasoning Skills into Language Models , 2020, ACL.

[40]  Jiashi Feng,et al.  ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning , 2020, ICLR.

[41]  Xiang Ren,et al.  KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning , 2019, EMNLP.

[42]  Kenton Lee,et al.  Giving BERT a Calculator: Finding Operations and Arguments with Reading Comprehension , 2019, EMNLP.

[43]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[44]  Joelle Pineau,et al.  CLUTRR: A Diagnostic Benchmark for Inductive Reasoning from Text , 2019, EMNLP.

[45]  Zhen Huang,et al.  A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning , 2019, EMNLP.

[46]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[47]  Chang Zhou,et al.  Cognitive Graph for Multi-Hop Reading Comprehension at Scale , 2019, ACL.

[48]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[49]  Mona Attariyan,et al.  Parameter-Efficient Transfer Learning for NLP , 2019, ICML.

[50]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[51]  Mohit Bansal,et al.  Commonsense for Generative Multi-Hop Question Answering Tasks , 2018, EMNLP.

[52]  Xiaoyan Wang,et al.  Improving Natural Language Inference Using External Knowledge in the Science Questions Domain , 2018, AAAI.

[53]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[54]  Todor Mihaylov,et al.  Knowledgeable Reader: Enhancing Cloze-Style Reading Comprehension with External Commonsense Knowledge , 2018, ACL.

[55]  Dan Roth,et al.  Solving General Arithmetic Word Problems , 2016, EMNLP.

[56]  Oren Etzioni,et al.  Parsing Algebraic Word Problems into Equations , 2015, TACL.

[57]  Eric P. Xing,et al.  Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2014, ACL 2014.

[58]  Jonathan Berant,et al.  CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge , 2019, NAACL.

[59]  Souvik Kundu,et al.  Exploiting Explicit Paths for Multi-hop Reading Comprehension , 2019, ACL.

[60]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .