How well do Large Language Models perform in Arithmetic tasks?

Large language models have emerged abilities including chain-of-thought to answer math word problems step by step. Solving math word problems not only requires abilities to disassemble problems via chain-of-thought but also needs to calculate arithmetic expressions correctly for each step. To the best of our knowledge, there is no work to focus on evaluating the arithmetic ability of large language models. In this work, we propose an arithmetic dataset MATH 401 to test the latest large language models including GPT-4, ChatGPT, InstrctGPT, Galactica, and LLaMA with various arithmetic expressions and provide a detailed analysis of the ability of large language models. MATH 401 and evaluation codes are released at \url{https://github.com/GanjinZero/math401-llm}.

[1]  Naman Goyal,et al.  LLaMA: Open and Efficient Foundation Language Models , 2023, ArXiv.

[2]  P. Shakarian,et al.  An Independent Evaluation of ChatGPT on Mathematical Word Problems (MWP) , 2023, AAAI Spring Symposium: MAKE.

[3]  Luke Zettlemoyer,et al.  Toolformer: Language Models Can Teach Themselves to Use Tools , 2023, NeurIPS.

[4]  Thomas Lukasiewicz,et al.  Mathematical Capabilities of ChatGPT , 2023, ArXiv.

[5]  Matteo Muffo,et al.  Evaluating Transformer Language Models on Arithmetic Operations Using Number Decomposition , 2023, LREC.

[6]  Guillem Cucurull,et al.  Galactica: A Large Language Model for Science , 2022, ArXiv.

[7]  Alexander M. Rush,et al.  BLOOM: A 176B-Parameter Open-Access Multilingual Language Model , 2022, ArXiv.

[8]  Dragomir R. Radev,et al.  Crosslingual Generalization through Multitask Finetuning , 2022, ArXiv.

[9]  Hyung Won Chung,et al.  Language Models are Multilingual Chain-of-Thought Reasoners , 2022, ICLR.

[10]  Nikunj Saunshi,et al.  Symbolic Math Reasoning with Language Models , 2022, 2022 IEEE MIT Undergraduate Research Technology Conference (URTC).

[11]  J. Dean,et al.  Emergent Abilities of Large Language Models , 2022, Trans. Mach. Learn. Res..

[12]  Markus N. Rabe,et al.  Autoformalization with Large Language Models , 2022, NeurIPS.

[13]  S. Gu,et al.  Large Language Models are Zero-Shot Reasoners , 2022, NeurIPS.

[14]  D. Schuurmans,et al.  Least-to-Most Prompting Enables Complex Reasoning in Large Language Models , 2022, ICLR.

[15]  Xi Victoria Lin,et al.  OPT: Open Pre-trained Transformer Language Models , 2022, ArXiv.

[16]  Stella Rose Biderman,et al.  GPT-NeoX-20B: An Open-Source Autoregressive Language Model , 2022, BIGSCIENCE.

[17]  Ryan J. Lowe,et al.  Training language models to follow instructions with human feedback , 2022, NeurIPS.

[18]  Matt Gardner,et al.  Impact of Pretraining Term Frequencies on Few-Shot Reasoning , 2022, ArXiv.

[19]  Dale Schuurmans,et al.  Chain of Thought Prompting Elicits Reasoning in Large Language Models , 2022, NeurIPS.

[20]  Reza Yazdani Aminabadi,et al.  Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model , 2022, ArXiv.

[21]  Renelito Delos Santos,et al.  LaMDA: Language Models for Dialog Applications , 2022, ArXiv.

[22]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[23]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[24]  Pooyan Jamshidi,et al.  Pretrained Language Models are Symbolic Mathematics Solvers too! , 2021, ArXiv.

[25]  Yue Zhang,et al.  Exploring Generalization Ability of Pretrained Language Models on Arithmetic and Logical Reasoning , 2021, NLPCC.

[26]  Wojciech Zaremba,et al.  Evaluating Large Language Models Trained on Code , 2021, ArXiv.

[27]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[28]  Rodrigo Nogueira,et al.  Investigating the Limitations of Transformers with Simple Arithmetic Tasks , 2021, 2102.13019.

[29]  Sung-Hyon Myaeng,et al.  Have You Seen That Number? Investigating Extrapolation in Question Answering Models , 2021, EMNLP.

[30]  Ilya Sutskever,et al.  Generative Language Modeling for Automated Theorem Proving , 2020, ArXiv.

[31]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[32]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.