Solving Quantitative Reasoning Problems with Language Models

Language models have achieved remarkable performance on a wide range of tasks that require natural language understanding. Nevertheless, state-of-the-art models have generally struggled with tasks that require quantitative reasoning, such as solving mathematics, science, and engineering problems at the college level. To help close this gap, we introduce Minerva , a large language model pretrained on general natural language data and further trained on technical content. The model achieves state-of-the-art performance on technical benchmarks without the use of external tools. We also evaluate our model on over two hundred undergraduate-level problems in physics, biology, chemistry, economics, and other sciences that require quantitative reasoning, and find that the model can correctly answer nearly a third of them.

[1]  Markus N. Rabe,et al.  Autoformalization with Large Language Models , 2022, NeurIPS.

[2]  Yuhuai Wu,et al.  Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers , 2022, NeurIPS.

[3]  Marc van Zee,et al.  Scaling Up Models and Data with t5x and seqio , 2022, J. Mach. Learn. Res..

[4]  Lisa Anne Hendricks,et al.  Training Compute-Optimal Large Language Models , 2022, ArXiv.

[5]  D. Schuurmans,et al.  Self-Consistency Improves Chain of Thought Reasoning in Language Models , 2022, ArXiv.

[6]  Quantifying Memorization Across Neural Language Models , 2022, ArXiv.

[7]  Cherepanov,et al.  Competition-level code generation with AlphaCode , 2022, Science.

[8]  Jesse Michael Han,et al.  Formal Mathematics Statement Curriculum Learning , 2022, ICLR.

[9]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, ArXiv.

[10]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[11]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[12]  G. Strang,et al.  A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level , 2021, Proceedings of the National Academy of Sciences of the United States of America.

[13]  Jesse Michael Han,et al.  Proof Artifact Co-training for Theorem Proving with Language Models , 2021, ICLR.

[14]  Po-Sen Huang,et al.  Scaling Language Models: Methods, Analysis & Insights from Training Gopher , 2021, ArXiv.

[15]  David Bieber,et al.  Show Your Work: Scratchpads for Intermediate Computation with Language Models , 2021, ArXiv.

[16]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[17]  Charles Sutton,et al.  Program Synthesis with Large Language Models , 2021, ArXiv.

[18]  Josef Urban,et al.  Fast and Slow Enigmas and Parental Guidance , 2021, FroCoS.

[19]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[20]  Lawrence C. Paulson,et al.  IsarStep: a Benchmark for High-level Mathematical Reasoning , 2021, ICLR.

[21]  Yejin Choi,et al.  NaturalProofs: Mathematical Theorem Proving in Natural Language , 2021, NeurIPS Datasets and Benchmarks.

[22]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[23]  Markus N. Rabe,et al.  LIME: Learning Inductive Bias for Primitives of Mathematical Reasoning , 2021, ICML.

[24]  Dawn Song,et al.  Measuring Massive Multitask Language Understanding , 2020, ICLR.

[25]  Christian Szegedy,et al.  Mathematical Reasoning via Self-supervised Skip-tree Training , 2020, ICLR.

[26]  Ilya Sutskever,et al.  Generative Language Modeling for Automated Theorem Proving , 2020, ArXiv.

[27]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[28]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[29]  Yejin Choi,et al.  The Curious Case of Neural Text Degeneration , 2019, ICLR.

[30]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[31]  Quoc V. Le,et al.  A Simple Method for Commonsense Reasoning , 2018, ArXiv.

[32]  Cezary Kaliszyk,et al.  Reinforcement Learning of Theorem Proving , 2018, NeurIPS.

[33]  Andy R. Terrel,et al.  SymPy: Symbolic computing in Python , 2017, PeerJ Prepr..

[34]  Josef Urban,et al.  DeepMath - Deep Sequence Models for Premise Selection , 2016, NIPS.

[35]  Oren Etzioni,et al.  Parsing Algebraic Word Problems into Equations , 2015, TACL.

[36]  Jeremy Avigad,et al.  The Lean Theorem Prover (System Description) , 2015, CADE.

[37]  Oren Etzioni,et al.  Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.

[38]  Stephan Schulz,et al.  System Description: E 1.8 , 2013, LPAR.

[39]  Andrei Voronkov,et al.  First-Order Theorem Proving and Vampire , 2013, CAV.

[40]  Adam Naumowicz,et al.  Mizar in a Nutshell , 2010, J. Formaliz. Reason..

[41]  Tobias Nipkow,et al.  The Isabelle Framework , 2008, TPHOLs.

[42]  Jens Otten,et al.  leanCoP 2.0and ileanCoP 1.2: High Performance Lean Theorem Proving in Classical and Intuitionistic Logic (System Descriptions) , 2008, IJCAR.

[43]  John Harrison,et al.  HOL Light: A Tutorial Introduction , 1996, FMCAD.

[44]  Norman D. Megill,et al.  Metamath A Computer Language for Pure Mathematics , 1969 .