Semantics Altering Modifications for Evaluating Comprehension in Machine Reading

Advances in NLP have yielded impressive results for the task of machine reading comprehension (MRC), with approaches having been reported to achieve performance comparable to that of humans. In this paper, we investigate whether state-of-the-art MRC models are able to correctly process Semantics Altering Modifications (SAM): linguistically-motivated phenomena that alter the semantics of a sentence while preserving most of its lexical surface form. We present a method to automatically generate and align challenge sets featuring original and altered examples. We further propose a novel evaluation methodology to correctly assess the capability of MRC systems to process these examples independent of the data they were optimised on, by discounting for effects introduced by domain shift. In a large-scale empirical study, we apply the methodology in order to evaluate extractive MRC models with regard to their capability to correctly process SAM-enriched data. We comprehensively cover 12 different state-of-the-art neural architecture configurations and four training datasets and find that -- despite their well-known remarkable performance -- optimised models consistently struggle to correctly process semantically altered data.

[1]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[2]  Yoshihiko Hayashi,et al.  Answerable or Not: Devising a Dataset for Extending Machine Reading Comprehension , 2018, COLING.

[3]  Ashish Sabharwal,et al.  What Does My QA Model Know? Devising Controlled Probes Using Expert Knowledge , 2019, Transactions of the Association for Computational Linguistics.

[4]  Emiel Krahmer,et al.  Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation , 2017, J. Artif. Intell. Res..

[5]  Christopher Joseph Pal,et al.  Interactive Language Learning by Question Answering , 2019, EMNLP.

[6]  Hector J. Levesque,et al.  The Winograd Schema Challenge , 2011, AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning.

[7]  Yoko Iyeiri,et al.  Verbs of Implicit Negation and their Complements in the History of English , 2010 .

[8]  Mohit Bansal,et al.  Robust Machine Comprehension Models via Adversarial Training , 2018, NAACL.

[9]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[10]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[11]  Reut Tsarfaty,et al.  Evaluating NLP Models via Contrast Sets , 2020, ArXiv.

[12]  Jason Weston,et al.  Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks , 2015, ICLR.

[13]  Maria Aloni,et al.  The Cambridge Handbook of Formal Semantics , 2016 .

[14]  Eduard Hovy,et al.  Learning the Difference that Makes a Difference with Counterfactually-Augmented Data , 2020, ICLR.

[15]  Mohit Bansal,et al.  Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA , 2019, ACL.

[16]  Sameer Singh,et al.  Are Red Roses Red? Evaluating Consistency of Question-Answering Models , 2019, ACL.

[17]  Ido Dagan,et al.  Annotating and Predicting Non-Restrictive Noun Phrase Modifications , 2016, ACL.

[18]  Shi Feng,et al.  Misleading Failures of Partial-input Baselines , 2019, ACL.

[19]  Danielle S McNamara,et al.  The tool for the automatic analysis of text cohesion (TAACO): Automatic assessment of local, global, and text cohesion , 2015, Behavior Research Methods.

[20]  Jonathan Berant,et al.  MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension , 2019, ACL.

[21]  Carlos Guestrin,et al.  Semantically Equivalent Adversarial Rules for Debugging NLP models , 2018, ACL.

[22]  Sebastian Riedel,et al.  Constructing Datasets for Multi-hop Reading Comprehension Across Documents , 2017, TACL.

[23]  Kyunghyun Cho,et al.  SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine , 2017, ArXiv.

[24]  Lauri Karttunen,et al.  Simple and Phrasal Implicatives , 2012, *SEMEVAL.

[25]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[26]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[27]  Yoshua Bengio,et al.  HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering , 2018, EMNLP.

[28]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[29]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[30]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[31]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[32]  Christian S. Jensen,et al.  Modification , 1995, The TSQL2 Temporal Query Language.

[33]  Kentaro Inui,et al.  What Makes Reading Comprehension Questions Easier? , 2018, EMNLP.

[34]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[35]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.

[36]  Philip Bachman,et al.  NewsQA: A Machine Comprehension Dataset , 2016, Rep4NLP@ACL.

[37]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[38]  Akiko Aizawa,et al.  Evaluation Metrics for Machine Reading Comprehension: Prerequisite Skills and Readability , 2017, ACL.

[39]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[40]  Roser Morante,et al.  ConanDoyle-neg: Annotation of negation cues and their scope in Conan Doyle stories , 2012, LREC.

[41]  Goran Nenadic,et al.  Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models , 2020, ArXiv.

[42]  Matthew J. Hausknecht,et al.  TextWorld: A Learning Environment for Text-based Games , 2018, CGW@IJCAI.

[43]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[44]  Livio Robaldo,et al.  The Penn Discourse TreeBank 2.0. , 2008, LREC.

[45]  Goran Nenadic,et al.  A Framework for Evaluation of Machine Reading Comprehension Gold Standards , 2020, LREC.

[46]  Mihai Dascalu,et al.  The Tool for the Automatic Analysis of Cohesion 2.0: Integrating semantic similarity and text overlap , 2018, Behavior Research Methods.

[47]  Wentao Ma,et al.  Benchmarking Robustness of Machine Reading Comprehension Models , 2021, FINDINGS.

[48]  Ali Farhadi,et al.  Bidirectional Attention Flow for Machine Comprehension , 2016, ICLR.

[49]  Christopher Potts,et al.  Posing Fair Generalization Tasks for Natural Language Inference , 2019, EMNLP.

[50]  Gabriel Stanovsky,et al.  DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs , 2019, NAACL.