LILA: A Unified Benchmark for Mathematical Reasoning

Mathematical reasoning skills are essential for general-purpose intelligentsystems to perform tasks from grocery shopping to climate modeling.Towards evaluating and improving AI systems in this domain, we proposeLILA, a unified mathematical reasoning benchmark consisting of 23 diversetasks along four dimensions:(i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs,thereby obtaining explainable solutions in addition to the correct answer.We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation.Finally, we introduce BHASKARA,a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models),while the best performing model only obtains 60.40%,indicating the room for improvement in general mathematical reasoning and understanding.

[1]  Swaroop Mishra,et al.  How Many Data Samples is an Additional Instruction Worth? , 2022, FINDINGS.

[2]  Matt Gardner,et al.  QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension , 2021, ACM Comput. Surv..

[3]  Song-Chun Zhu,et al.  Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning , 2022, ICLR.

[4]  Peter Clark,et al.  NumGLUE: A Suite of Fundamental yet Challenging Mathematical Reasoning Tasks , 2022, ACL.

[5]  Alexander M. Rush,et al.  Multitask Prompted Training Enables Zero-Shot Task Generalization , 2021, ICLR.

[6]  Yejin Choi,et al.  Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics , 2021, AAAI.

[7]  Yejin Choi,et al.  Reframing Instructional Prompts to GPTk’s Language , 2021, FINDINGS.

[8]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[9]  Hannaneh Hajishirzi,et al.  Cross-Task Generalization via Natural Language Crowdsourcing Instructions , 2021, ACL.

[10]  Jesse Michael Han,et al.  Proof Artifact Co-training for Theorem Proving with Language Models , 2021, ICLR.

[11]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[12]  Song-Chun Zhu,et al.  IconQA: A New Benchmark for Abstract Diagram Understanding and Visual Language Reasoning , 2021, NeurIPS Datasets and Benchmarks.

[13]  Song-Chun Zhu,et al.  Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning , 2021, ACL.

[14]  Lawrence C. Paulson,et al.  IsarStep: a Benchmark for High-level Mathematical Reasoning , 2021, ICLR.

[15]  Yejin Choi,et al.  NaturalProofs: Mathematical Theorem Proving in Natural Language , 2021, NeurIPS Datasets and Benchmarks.

[16]  Yejin Choi,et al.  UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark , 2021, AAAI.

[17]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[18]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[19]  Sonal Gupta,et al.  Muppet: Massive Multi-task Representations with Pre-Finetuning , 2021, EMNLP.

[20]  Jimmy Ba,et al.  INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.

[21]  Dan Roth,et al.  Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.

[22]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[23]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[24]  Sameer Singh,et al.  Beyond Accuracy: Behavioral Testing of NLP Models with CheckList , 2020, ACL.

[25]  Bill Yuchen Lin,et al.  Birds Have Four Legs?! NumerSense: Probing Numerical Commonsense Knowledge of Pre-trained Language Models , 2020, EMNLP.

[26]  Dawn Song,et al.  Pretrained Transformers Improve Out-of-Distribution Robustness , 2020, ACL.

[27]  Sameer Singh,et al.  ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension , 2019, ArXiv.

[28]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[29]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[30]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[31]  Carolyn Penstein Rosé,et al.  EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[32]  Peter Clark,et al.  QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships , 2018, AAAI.

[33]  K. Ramasubramanian,et al.  Use of calculus in Hindu mathematics , 2019, Sources and Studies in the History of Mathematics and Physical Sciences.

[34]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[35]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[36]  Dan Roth,et al.  Mapping to Declarative Knowledge for Word Problem Solving , 2017, TACL.

[37]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[38]  Dan Roth,et al.  Unit Dependency Graph and Its Application to Arithmetic Word Problem Solving , 2016, AAAI.

[39]  Ming-Wei Chang,et al.  Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems , 2016, EMNLP.

[40]  Wei-Ying Ma,et al.  How well do Computers Solve Math Word Problems? Large-Scale Dataset Construction and Evaluation , 2016, ACL.

[41]  Hannaneh Hajishirzi,et al.  MAWPS: A Math Word Problem Repository , 2016, NAACL.

[42]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[43]  Dan Roth,et al.  Solving General Arithmetic Word Problems , 2016, EMNLP.

[44]  Oren Etzioni,et al.  Parsing Algebraic Word Problems into Equations , 2015, TACL.

[45]  Ming-Wei Chang,et al.  DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .

[46]  Dan Roth,et al.  Reasoning about Quantities in Natural Language , 2015, TACL.

[47]  Oren Etzioni,et al.  Learning to Solve Arithmetic Word Problems with Verb Categorization , 2014, EMNLP.

[48]  Luke S. Zettlemoyer,et al.  Learning to Automatically Solve Algebra Word Problems , 2014, ACL.

[49]  Benoît Sagot,et al.  Crowdsourcing for Language Resource Development: Critical Analysis of Amazon Mechanical Turk Overpowering Use , 2011, LTC 2011.

[50]  B. Sarkar Hindu Achievements in Exact Science: A Study in the History of Scientific Development , 2006, Nature.

[51]  Yejin Choi,et al.  Symbolic Brittleness in Sequence Models: on Systematic Generalization in Symbolic Mathematics , 2021, AAAI.

[52]  Quoc V. Le,et al.  Finetuned Language Models Are Zero-Shot Learners , 2021, ICLR.

[53]  Mohammad Bavarian,et al.  Training Verifiers to Solve Math Word Problems , 2021, ArXiv.

[54]  Yejin Choi,et al.  NaturalProofs: Mathematical Theorem Proving in Natural Language , 2021, NeurIPS Datasets and Benchmarks.

[55]  Stella Biderman,et al.  GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow , 2021 .

[56]  Jimmy Ba,et al.  INT: An Inequality Benchmark for Evaluating Generalization in Theorem Proving , 2020, ICLR.

[57]  Dan Roth,et al.  Do Language Embeddings capture Scales? , 2020, BLACKBOXNLP.

[58]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[59]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[60]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[61]  Graham Neubig,et al.  Learning to Mine Aligned Code and Natural Language Pairs from Stack Overflow , 2018, 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR).

[62]  Leonidas J. Guibas,et al.  Taskonomy: Disentangling Task Transfer Learning , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[63]  Ming-Wei Chang,et al.  Learning from Explicit and Implicit Supervision Jointly For Algebra Word Problems , 2016, EMNLP.

[64]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[65]  Ming-Wei Chang,et al.  DRAW: A Challenging and Diverse Algebra Word Problem Set , 2015 .

[66]  Benoît Sagot,et al.  Crowdsourcing for Language Resource Development: Critical Analysis of Amazon Mechanical Turk Overpowering Use , 2011, LTC 2011.