NOAHQA: Numerical Reasoning with Interpretable Graph Question Answering Dataset

While diverse question answering (QA) datasets have been proposed and contributed significantly to the development of deep learning models for QA tasks, the existing datasets fall short in two aspects. First, we lack QA datasets covering complex questions that involve answers as well as the reasoning processes to get the answers. As a result, the state-of-the-art QA research on numerical reasoning still focuses on simple calculations and does not provide the mathematical expressions or evidences justifying the answers. Second, the QA community has contributed much effort to improving the interpretability of QA models. However, these models fail to explicitly show the reasoning process, such as the evidence order for reasoning and the interactions between different pieces of evidence. To address the above shortcomings, we introduce NOAHQA, a conversational and bilingual QA dataset with questions requiring numerical reasoning with compound mathematical expressions. With NOAHQA, we develop an interpretable reasoning graph as well as the appropriate evaluation metric to measure the answer quality. We evaluate the state-of-the-art QA models trained using existing QA datasets on NOAHQA and show that the best among them can only achieve 55.5 exact match scores, while the human performance is 89.7. We also present a new QA model for generating a reasoning graph where the reasoning graph metric still has a large gap compared with that of humans, e.g., 28 scores. The dataset and code are publicly available 1.

[1]  Shuohang Wang,et al.  Machine Comprehension Using Match-LSTM and Answer Pointer , 2016, ICLR.

[2]  Bhavana Dalvi,et al.  Explaining Answers with Entailment Trees , 2021, EMNLP.

[3]  Wenhu Chen,et al.  HybridQA: A Dataset of Multi-Hop Question Answering over Tabular and Textual Data , 2020, EMNLP.

[4]  Wang Ling,et al.  Program Induction by Rationale Generation: Learning to Solve and Explain Algebraic Word Problems , 2017, ACL.

[5]  Daniel Deutch,et al.  Break It Down: A Question Understanding Benchmark , 2020, TACL.

[6]  Christopher D. Manning,et al.  Get To The Point: Summarization with Pointer-Generator Networks , 2017, ACL.

[7]  Keh-Yih Su,et al.  A Meaning-Based Statistical English Math Word Problem Solver , 2018, NAACL.

[8]  Claire Cardie,et al.  DREAM: A Challenge Data Set and Models for Dialogue-Based Reading Comprehension , 2019, TACL.

[9]  Jinlan Fu,et al.  A Knowledge-Aware Sequence-to-Tree Network for Math Word Problem Solving , 2020, EMNLP.

[10]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[11]  Hannaneh Hajishirzi,et al.  MAWPS: A Math Word Problem Repository , 2016, NAACL.

[12]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[13]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[14]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[15]  Heng Tao Shen,et al.  MathDQN: Solving Arithmetic Word Problems via Deep Reinforcement Learning , 2018, AAAI.

[16]  Mohit Bansal,et al.  PRover: Proof Generation for Interpretable Reasoning over Rules , 2020, EMNLP.

[17]  Dan Roth,et al.  Mapping to Declarative Knowledge for Word Problem Solving , 2017, TACL.

[18]  Xiaodan Liang,et al.  Semantically-Aligned Universal Tree-Structured Solver for Math Word Problems , 2020, EMNLP.

[19]  Akiko Aizawa,et al.  Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps , 2020, COLING.

[20]  Ming-Wei Chang,et al.  Natural Questions: A Benchmark for Question Answering Research , 2019, TACL.

[21]  Yan Wang,et al.  Graph-to-Tree Learning for Solving Math Word Problems , 2020, ACL.

[22]  Yejin Choi,et al.  MathQA: Towards Interpretable Math Word Problem Solving with Operation-Based Formalisms , 2019, NAACL.

[23]  Keh-Yih Su,et al.  A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers , 2020, ACL.

[24]  Yan Wang,et al.  Translating a Math Word Problem to a Expression Tree , 2018, EMNLP.

[25]  Navin Goyal,et al.  Are NLP Models really able to Solve Simple Math Word Problems? , 2021, NAACL.

[26]  Zhiyuan Liu,et al.  NumNet: Machine Reading Comprehension with Numerical Reasoning , 2019, EMNLP.

[27]  Jonathan Berant,et al.  The Web as a Knowledge-Base for Answering Complex Questions , 2018, NAACL.

[28]  Zachary C. Lipton,et al.  How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks , 2018, EMNLP.

[29]  Danqi Chen,et al.  CoQA: A Conversational Question Answering Challenge , 2018, TACL.

[30]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[31]  Veselin Stoyanov,et al.  Unsupervised Cross-lingual Representation Learning at Scale , 2019, ACL.

[32]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[33]  Sameer Singh,et al.  Compositional Questions Do Not Necessitate Multi-hop Reasoning , 2019, ACL.

[34]  Chitta Baral,et al.  Learning To Use Formulas To Solve Simple Arithmetic Problems , 2016, ACL.

[35]  Siva Reddy,et al.  Measuring Systematic Generalization in Neural Proof Generation with Transformers , 2020, NeurIPS.

[36]  Eunsol Choi,et al.  CONVERSATIONAL MACHINE COMPREHENSION , 2019 .

[37]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[38]  Eunsol Choi,et al.  QuAC: Question Answering in Context , 2018, EMNLP.

[39]  Shuming Shi,et al.  Deep Neural Solver for Math Word Problems , 2017, EMNLP.

[40]  Heng Tao Shen,et al.  Template-Based Math Word Problem Solvers with Recursive Neural Networks , 2019, AAAI.

[41]  Guokun Lai,et al.  RACE: Large-scale ReAding Comprehension Dataset From Examinations , 2017, EMNLP.

[42]  Quoc V. Le,et al.  QANet: Combining Local Convolution with Global Self-Attention for Reading Comprehension , 2018, ICLR.

[43]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.

[44]  Zhipeng Xie,et al.  A Goal-Driven Tree-Structured Neural Model for Math Word Problems , 2019, IJCAI.

[45]  Gerhard Weikum,et al.  Look before you Hop: Conversational Question Answering over Knowledge Graphs Using Judicious Context Expansion , 2019, CIKM.