TheoremQA: A Theorem-driven Question Answering dataset

The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in solving fundamental math problems like GSM8K by achieving over 90% accuracy. However, their capabilities to solve more challenging math problems which require domain-specific knowledge (i.e. theorem) have yet to be investigated. In this paper, we introduce TheoremQA, the first theorem-driven question-answering dataset designed to evaluate AI models' capabilities to apply theorems to solve challenging science problems. TheoremQA is curated by domain experts containing 800 high-quality questions covering 350 theorems (e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem, Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a wide spectrum of 16 large language and code models with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that GPT-4's capabilities to solve these problems are unparalleled, achieving an accuracy of 51% with Program-of-Thoughts Prompting. All the existing open-sourced models are below 15%, barely surpassing the random-guess baseline. Given the diversity and broad coverage of TheoremQA, we believe it can be used as a better benchmark to evaluate LLMs' capabilities to solve challenging science problems. The data and code are released in https://github.com/wenhuchen/TheoremQA.

[1]  Andrew M. Dai,et al.  PaLM 2 Technical Report , 2023, ArXiv.

[2]  Nghi D. Q. Bui,et al.  CodeT5+: Open Code Large Language Models for Code Understanding and Generation , 2023, EMNLP.

[3]  Harm de Vries,et al.  StarCoder: may the source be with you! , 2023, ArXiv.

[4]  Song-Chun Zhu,et al.  Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models , 2023, ArXiv.

[5]  Zhenguo Li,et al.  Progressive-Hint Prompting Improves Reasoning in Large Language Models , 2023, ArXiv.

[6]  Yong Jae Lee,et al.  Visual Instruction Tuning , 2023, ArXiv.

[7]  Oskar van der Wal,et al.  Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling , 2023, ICML.

[8]  Marco Tulio Ribeiro,et al.  Sparks of Artificial General Intelligence: Early experiments with GPT-4 , 2023, ArXiv.

[9]  Henrique Pondé de Oliveira Pinto,et al.  GPT-4 Technical Report , 2023, 2303.08774.

[10]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[11]  Noah D. Goodman,et al.  Task Ambiguity in Humans and Language Models , 2022, ArXiv.

[12]  Tom B. Brown,et al.  Constitutional AI: Harmlessness from AI Feedback , 2022, ArXiv.

[13]  Jinghui Qin,et al.  UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression , 2022, EMNLP.

[14]  William W. Cohen,et al.  Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks , 2022, ArXiv.

[15]  Jamie Callan,et al.  PAL: Program-aided Language Models , 2022, ICML.

[16]  Guillem Cucurull,et al.  Galactica: A Large Language Model for Science , 2022, ArXiv.

[17]  Oyvind Tafjord,et al.  LILA: A Unified Benchmark for Mathematical Reasoning , 2022, EMNLP.

[18]  P. Zhang,et al.  GLM-130B: An Open Bilingual Pre-trained Model , 2022, ICLR.

[19]  Xinyun Chen,et al.  Compositional Semantic Parsing with Large Language Models , 2022, ArXiv.

[20]  Song-Chun Zhu,et al.  Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , 2022, ICLR.

[21]  Song-Chun Zhu,et al.  Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering , 2022, NeurIPS.

[22]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[23]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[24]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[25]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[26]  Andrew M. Dai,et al.  PaLM: Scaling Language Modeling with Pathways , 2022, J. Mach. Learn. Res..

[27]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[28]  S. Savarese,et al.  CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis , 2022, ICLR.

[29]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ICLR.

[30]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[31]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[32]  S. Hoi,et al.  BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.

[33]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[34]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[35]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[36]  Sameena Shah,et al.  FinQA: A Dataset of Numerical Reasoning over Financial Data , 2021, EMNLP.

[37]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[38]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[39]  Eric P. Xing,et al.  GeoQA: A Geometric Question Answering Benchmark Towards Multimodal Numerical Reasoning , 2021, FINDINGS.

[40]  Fuli Feng,et al.  TAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance , 2021, ACL.

[41]  Song-Chun Zhu,et al.  Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.

[42]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[43]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[44]  Ilya Sutskever,et al.  Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.

[45]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[46]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[47]  Yejin Choi,et al.  MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[48]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[49]  Shuming Shi,et al.  Deep Neural Solver for Math Word Problems , 2017, EMNLP.

[50]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[51]  Ming-Wei Chang,et al.  Annotating Derivations: A New Evaluation Strategy and Dataset for Algebra Word Problems , 2016, EACL.

[52]  Dan Roth,et al.  Solving General Arithmetic Word Problems , 2016, EMNLP.

[53]  Hannaneh Hajishirzi,et al.  MAWPS: A Math Word Problem Repository , 2016, NAACL.

[54]  Oren Etzioni,et al.  Parsing Algebraic Word Problems into Equations , 2015, TACL.

[55]  Ming-Wei Chang,et al.  DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .

[56]  Oren Etzioni,et al.  Solving Geometry Problems: Combining Text and Diagram Interpretation , 2015, EMNLP.

[57]  Oren Etzioni,et al.  Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.