When Do Program-of-Thoughts Work for Reasoning?

The reasoning capabilities of Large Language Models (LLMs) play a pivotal role in the realm of embodied artificial intelligence. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score (CIRS), which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach. Code will be integrated into the EasyInstruct framework at https://github.com/zjunlp/EasyInstruct.

[1]  Lei Li,et al.  Making Large Language Models Better Reasoners with Alignment , 2023, ArXiv.

[2]  Chuanqi Tan,et al.  Scaling Relationship on Learning Mathematical Reasoning with Large Language Models , 2023, ArXiv.

[3]  Li Fei-Fei,et al.  VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models , 2023, ArXiv.

[4]  H. Palangi,et al.  Orca: Progressive Learning from Complex Explanation Traces of GPT-4 , 2023, ArXiv.

[5]  G. Dobbie,et al.  Large Language Models Are Not Abstract Reasoners , 2023, ArXiv.

[6]  Dongyan Zhao,et al.  The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code , 2023, ACL.

[7]  Zhouchen Lin,et al.  Code Prompting: a Neural Symbolic Method for Complex Reasoning in Large Language Models , 2023, ArXiv.

[8]  Tushar Khot,et al.  Chain-of-Thought Hub: A Continuous Effort to Measure Large Language Models' Reasoning Performance , 2023, ArXiv.

[9]  Yuke Zhu,et al.  Voyager: An Open-Ended Embodied Agent with Large Language Models , 2023, Trans. Mach. Learn. Res..

[10]  Kaiyan Zhang,et al.  PaD: Program-aided Distillation Specializes Large Models in Reasoning , 2023, ArXiv.

[11]  Mihir Parmar,et al.  Can NLP Models Correctly Reason Over Contexts that Break the Common Assumptions? , 2023, ArXiv.

[12]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[13]  Technology,et al.  CodeIE: Large Code Generation Models are Better Few-Shot Information Extractors , 2023, ACL.

[14]  Kenji Kawaguchi,et al.  Self-Evaluation Guided Beam Search for Reasoning , 2023, 2305.00633.

[15]  Wei Guo,et al.  CodeKGC: Code Language Model for Generative Knowledge Graph Construction , 2023, ArXiv.

[16]  Noah D. Goodman,et al.  Why think step-by-step? Reasoning emerges from the locality of experience , 2023, ArXiv.

[17]  Wayne Xin Zhao,et al.  A Survey of Large Language Models , 2023, ArXiv.

[18]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[19]  Shima Imani,et al.  MathPrompter: Mathematical Reasoning using Large Language Models , 2023, ACL.

[20]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[21]  Ashish Sabharwal,et al.  Specializing Smaller Language Models towards Multi-Step Reasoning , 2023, ICML.

[22]  K. Chang,et al.  Towards Reasoning in Large Language Models: A Survey , 2022, ACL.

[23]  Fei Huang,et al.  Reasoning with Language Model Prompting: A Survey , 2022, ACL.

[24]  Eric P. Xing,et al.  The Impact of Symbolic Representations on In-context Learning for Few-shot Reasoning , 2022, ArXiv.

[25]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[26]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[27]  Quoc V. Le,et al.  Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them , 2022, ACL.

[28]  Graham Neubig,et al.  Language Models of Code are Few-Shot Commonsense Learners , 2022, EMNLP.

[29]  Dragomir R. Radev,et al.  Binding Language Models in Symbolic Languages , 2022, ICLR.

[30]  Ashish Sabharwal,et al.  Complexity-Based Prompting for Multi-Step Reasoning , 2022, ICLR.

[31]  Peter R. Florence,et al.  Inner Monologue: Embodied Reasoning through Planning with Language Models , 2022, CoRL.

[32]  Xiaodan Liang,et al.  LogicSolver: Towards Interpretable Math Word Problem Solving with Logical Prompt-enhanced Learning , 2022, EMNLP.

[33]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[34]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[35]  Chuanqi Tan,et al.  KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction , 2021, WWW.

[36]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[37]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[38]  Noah A. Smith,et al.  Measuring Association Between Labels and Free-Text Rationales , 2020, EMNLP.

[39]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[40]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[41]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[42]  Dan Roth,et al.  Solving General Arithmetic Word Problems , 2016, EMNLP.

[43]  Benjamin S. Bloom,et al.  A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives , 2000 .

[44]  T. Haladyna Writing Test Items to Evaluate Higher Order Thinking , 1996 .

[45]  Nick Cercone,et al.  Computational Linguistics , 1986, Communications in Computer and Information Science.

[46]  Anas N. Al-Rabadi,et al.  A comparison of modified reconstructability analysis and Ashenhurst‐Curtis decomposition of Boolean functions , 2004 .

[47]  Heng Ji,et al.  Code4Struct: Code Generation for Few-Shot Structured Prediction from Natural Language , 2022, ArXiv.

[48]  A Taxonomy For Learning Teaching And Assessing A Revision Of Blooms Taxonomy Of Educational Objectives Abridged Edition Pdf File , 2021 .

[49]  Maurice H. Halstead,et al.  Elements of software science (Operating and programming systems series) , 1977 .