Tiered Reasoning for Intuitive Physics: Toward Verifiable Commonsense Language Understanding

Large-scale, pre-trained language models (LMs) have achieved human-level performance on a breadth of language understanding tasks. However, evaluations only based on end task performance shed little light on machines’ true ability in language understanding and reasoning. In this paper, we highlight the importance of evaluating the underlying reasoning process in addition to end performance. Toward this goal, we introduce Tiered Reasoning for Intuitive Physics (TRIP), a novel commonsense reasoning dataset with dense annotations that enable multi-tiered evaluation of machines’ reasoning process. Our empirical results show that while large LMs can achieve high end performance, they struggle to support their predictions with valid supporting evidence. The TRIP dataset and our baseline results will motivate verifiable evaluation of commonsense reasoning and facilitate future research toward developing better language understanding and reasoning models.

[1]  Ross B. Girshick,et al.  PHYRE: A New Benchmark for Physical Reasoning , 2019, NeurIPS.

[2]  Yonatan Belinkov,et al.  Don’t Take the Premise for Granted: Mitigating Artifacts in Natural Language Inference , 2019, ACL.

[3]  Yejin Choi,et al.  COMET: Commonsense Transformers for Automatic Knowledge Graph Construction , 2019, ACL.

[4]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[5]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[6]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[7]  R. Thomas McCoy,et al.  Syntactic Data Augmentation Increases Robustness to Inference Heuristics , 2020, ACL.

[8]  Jianfeng Gao,et al.  DeBERTa: Decoding-enhanced BERT with Disentangled Attention , 2020, ICLR.

[9]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[10]  Li Fei-Fei,et al.  CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[11]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[12]  Zornitsa Kozareva,et al.  SemEval-2012 Task 7: Choice of Plausible Alternatives: An Evaluation of Commonsense Causal Reasoning , 2011, *SEMEVAL.

[13]  Doug Downey,et al.  Abductive Commonsense Reasoning , 2019, ICLR.

[14]  Aditya Gupta,et al.  Effective Use of Transformer Networks for Entity Tracking , 2019, EMNLP.

[15]  Yejin Choi,et al.  SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference , 2018, EMNLP.

[16]  Christopher D. Manning,et al.  GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[17]  Sheng Zhang,et al.  Ordinal Common-sense Inference , 2016, TACL.

[18]  Roozbeh Mottaghi,et al.  ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks , 2020, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[19]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[20]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[21]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[22]  Alec Radford,et al.  Improving Language Understanding by Generative Pre-Training , 2018 .

[23]  Jacob Andreas,et al.  Experience Grounds Language , 2020, EMNLP.

[24]  Yejin Choi,et al.  ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning , 2019, AAAI.

[25]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[26]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[27]  Ernest Davis,et al.  Commonsense reasoning and commonsense knowledge in artificial intelligence , 2015, Commun. ACM.

[28]  Stefan Lee,et al.  Embodied Question Answering , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).

[29]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[30]  Shaohua Yang,et al.  Physical Causality of Action Verbs in Grounded Language Understanding , 2016, ACL.

[31]  Yejin Choi,et al.  PIQA: Reasoning about Physical Commonsense in Natural Language , 2019, AAAI.

[32]  Joshua B. Tenenbaum,et al.  Building machines that learn and think like people , 2016, Behavioral and Brain Sciences.

[33]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[34]  Tal Linzen,et al.  How Can We Accelerate Progress Towards Human-like Linguistic Generalization? , 2020, ACL.

[35]  Yejin Choi,et al.  Do Neural Language Representations Learn Physical Commonsense? , 2019, CogSci.

[36]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[37]  J. Bliss Commonsense reasoning about the physical world , 2008 .

[38]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[39]  Ernest Davis,et al.  Logical Formalizations of Commonsense Reasoning: A Survey , 2017, J. Artif. Intell. Res..

[40]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[41]  Hannaneh Hajishirzi,et al.  Logic-Guided Data Augmentation and Regularization for Consistent Question Answering , 2020, ACL.

[42]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[43]  Vivek Srikumar,et al.  A Logic-Driven Framework for Consistency of Neural Models , 2019, EMNLP.

[44]  Yejin Choi,et al.  The Effect of Different Writing Tasks on Linguistic Style: A Case Study of the ROC Story Cloze Task , 2017, CoNLL.

[45]  Yejin Choi,et al.  Procedural Reading Comprehension with Attribute-Aware Context Flow , 2020, AKBC.

[46]  Yejin Choi,et al.  Verb Physics: Relative Physical Knowledge of Actions and Objects , 2017, ACL.

[47]  Shagun Uppal,et al.  Two-Step Classification using Recasted Data for Low Resource Settings , 2020, AACL/IJCNLP.

[48]  Jacob Cohen A Coefficient of Agreement for Nominal Scales , 1960 .

[49]  Luke Zettlemoyer,et al.  Don’t Take the Easy Way Out: Ensemble Based Methods for Avoiding Known Dataset Biases , 2019, EMNLP.

[50]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[51]  Yejin Choi,et al.  Social IQA: Commonsense Reasoning about Social Interactions , 2019, EMNLP 2019.

[52]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[53]  Dan Roth,et al.  “Going on a vacation” takes longer than “Going for a walk”: A Study of Temporal Commonsense Understanding , 2019, EMNLP.

[54]  Nathanael Chambers,et al.  A Corpus and Evaluation Framework for Deeper Understanding of Commonsense Stories , 2016, ArXiv.

[55]  Alex Wang,et al.  What do you learn from context? Probing for sentence structure in contextualized word representations , 2019, ICLR.

[56]  P. Pierce Software verification and validation , 1996, IEEE Technical Applications Conference. Northcon/96. Conference Record.

[57]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[58]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[59]  Hung-Yu Kao,et al.  Probing Neural Network Comprehension of Natural Language Arguments , 2019, ACL.

[60]  Bhavana Dalvi,et al.  Tracking State Changes in Procedural Text: a Challenge Dataset and Models for Process Paragraph Comprehension , 2018, NAACL.

[61]  N. Mostafazadeh,et al.  GLUCOSE: GeneraLized and COntextualized Story Explanations , 2020, Conference on Empirical Methods in Natural Language Processing.

[62]  Omer Levy,et al.  Simulating Action Dynamics with Neural Process Networks , 2017, ICLR.

[63]  Jason Weston,et al.  Dialogue Natural Language Inference , 2018, ACL.

[64]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[65]  Qiang Ning,et al.  SPARTQA: A Textual Question Answering Benchmark for Spatial Reasoning , 2021, NAACL.