Overleaf Example

Many real-world problems require the combined application of multiple reasoning abilities—employing suitable abstractions, commonsense knowledge, and creative synthesis of problem-solving strategies. To help advance AI systems towards such capabilities, we propose a new reasoning challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible. For example, “How much would the sea level rise if all ice in the world melted?” FPs are commonly used in quizzes and interviews to bring out and evaluate the creative reasoning abilities of humans. To do the same for AI systems, we present two datasets: 1) A collection of 1k real-world FPs sourced from quizzes and olympiads; and 2) a bank of 10k synthetic FPs of intermediate complexity to serve as a sandbox for the harder real-world challenge. In addition to question-answer pairs, the datasets contain detailed solutions in the form of an executable program and supporting facts, helping in supervision and evaluation of intermediate steps. We demonstrate that even extensively fine-tuned large-scale language models perform poorly on these datasets, on average making estimates that are off by two orders of magnitude. Our contribution is thus the crystallization of several unsolved AI problems into a single, new challenge that we hope will spur further advances in building systems that can reason.

[1]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[2]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[3]  Pushmeet Kohli,et al.  Analysing Mathematical Reasoning Abilities of Neural Models , 2019, ICLR.

[4]  Dawn Song,et al.  Measuring Mathematical Problem Solving With the MATH Dataset , 2021, NeurIPS Datasets and Benchmarks.

[5]  Guillaume Lample,et al.  Deep Learning for Symbolic Mathematics , 2019, ICLR.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Andrew Chou,et al.  Semantic Parsing on Freebase from Question-Answer Pairs , 2013, EMNLP.

[8]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[9]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[10]  Taylor Berg-Kirkpatrick,et al.  An Empirical Investigation of Contextualized Number Prediction , 2020, EMNLP.

[11]  Jonathan Berant,et al.  Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies , 2021, Transactions of the Association for Computational Linguistics.

[12]  Peter Clark,et al.  Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering , 2018, EMNLP.

[13]  D. Roth,et al.  Do Language Embeddings capture Scales? , 2020, FINDINGS.

[14]  Jungo Kasai,et al.  GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation , 2021, ArXiv.

[15]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[16]  Yejin Choi,et al.  MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[17]  Oren Etzioni,et al.  Parsing Algebraic Word Problems into Equations , 2015, TACL.

[18]  Adam Wierman,et al.  Thinking Fast and Slow , 2017, SIGMETRICS Perform. Evaluation Rev..

[19]  Sebastian Riedel,et al.  Numeracy for Language Models: Evaluating and Improving their Ability to Predict Numbers , 2018, ACL.

[20]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..