论文信息 - TheoremQA: A Theorem-driven Question Answering dataset - 字舞流文

TheoremQA: A Theorem-driven Question Answering dataset

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

Wenhu Chen | Xinyi Wang | Ming Yin | Xueguang Ma | Yixin Wan | Pan Lu | Tony Xia | Wenhu Chen | Jianyu Xu | Max Ku | Max W.F. Ku

[1] Andrew M. Dai,et al. PaLM 2 Technical Report , 2023, ArXiv.

[2] Nghi D. Q. Bui,et al. CodeT5+: Open Code Large Language Models for Code Understanding and Generation , 2023, EMNLP.

[3] Harm de Vries,et al. StarCoder: may the source be with you! , 2023, ArXiv.

[4] Song-Chun Zhu,et al. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , 2023, ArXiv.

[5] Zhenguo Li,et al. Progressive-Hint Prompting Improves Reasoning in Large Language Models , 2023, ArXiv.

[6] Yong Jae Lee,et al. Visual Instruction Tuning , 2023, ArXiv.

[7] Oskar van der Wal,et al. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[8] Marco Tulio Ribeiro,et al. Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[9] Henrique Pondé de Oliveira Pinto,et al. GPT-4 Technical Report , 2023, 2303.08774.

[10] Naman Goyal,et al. LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11] Noah D. Goodman,et al. Task Ambiguity in Humans and Language Models , 2022, ArXiv.

[12] Tom B. Brown,et al. Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[13] Jinghui Qin,et al. UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression , 2022, EMNLP.

[14] William W. Cohen,et al. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[15] Jamie Callan,et al. PAL: Program-aided Language Models , 2022, ICML.

[16] Guillem Cucurull,et al. Galactica: A Large Language Model for Science , 2022, ArXiv.

[17] Oyvind Tafjord,et al. LILA: A Unified Benchmark for Mathematical Reasoning , 2022, EMNLP.

[18] P. Zhang,et al. GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[19] Xinyun Chen,et al. Compositional Semantic Parsing with Large Language Models , 2022, ArXiv.

[20] Song-Chun Zhu,et al. Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , 2022, ICLR.

[21] Song-Chun Zhu,et al. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , 2022, NeurIPS.

[22] J. Dean,et al. Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[23] S. Gu,et al. Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[24] D. Schuurmans,et al. Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[25] Xi Victoria Lin,et al. OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[26] Andrew M. Dai,et al. PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[27] Lisa Anne Hendricks,et al. Training Compute-Optimal Large Language Models , 2022, ArXiv.

[28] S. Savarese,et al. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[29] D. Schuurmans,et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[30] Ryan J. Lowe,et al. Training language models to follow instructions with human feedback , 2022, NeurIPS.

[31] Dale Schuurmans,et al. Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[32] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[33] Po-Sen Huang,et al. Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[34] David Bieber,et al. Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[35] Mohammad Bavarian,et al. Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[36] Sameena Shah,et al. FinQA: A Dataset of Numerical Reasoning over Financial Data , 2021, EMNLP.

[37] Charles Sutton,et al. Program Synthesis with Large Language Models , 2021, ArXiv.

[38] Wojciech Zaremba,et al. Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[39] Eric P. Xing,et al. GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning , 2021, FINDINGS.

[40] Fuli Feng,et al. TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance , 2021, ACL.

[41] Song-Chun Zhu,et al. Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.

[42] Navin Goyal,et al. Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[43] Dawn Song,et al. Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[44] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[45] Keh-Yih Su,et al. A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[46] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[47] Yejin Choi,et al. MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[48] Pushmeet Kohli,et al. Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[49] Shuming Shi,et al. Deep Neural Solver for Math Word Problems , 2017, EMNLP.

[50] Wang Ling,et al. Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[51] Ming-Wei Chang,et al. Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems , 2016, EACL.

[52] Dan Roth,et al. Solving General Arithmetic Word Problems , 2016, EMNLP.

[53] Hannaneh Hajishirzi,et al. MAWPS: A Math Word Problem Repository , 2016, NAACL.

[54] Oren Etzioni,et al. Parsing Algebraic Word Problems into Equations , 2015, TACL.

[55] Ming-Wei Chang,et al. DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .

[56] Oren Etzioni,et al. Solving Geometry Problems: Combining Text and Diagram Interpretation , 2015, EMNLP.

[57] Oren Etzioni,et al. Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.