Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area

[1]  Rachel Rudinger,et al.  Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation , 2018, BlackboxNLP@EMNLP.

[2]  Aaron Steven White,et al.  The role of veridicality and factivity in clause selection * , 2017 .

[3]  Christopher Potts,et al.  Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences , 2018, ArXiv.

[4]  Niranjan Balasubramanian,et al.  The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models , 2018, ArXiv.

[5]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[6]  Jun Zhu,et al.  Visual Concepts and Compositional Voting , 2017, ArXiv.

[7]  Robert Frank,et al.  Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks , 2018, CogSci.

[8]  Asim Kadav,et al.  Teaching Syntax by Adversarial Distraction , 2018, ArXiv.

[9]  Christopher D. Manning,et al.  Natural language inference , 2009 .

[10]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[11]  A. Hollingworth,et al.  Thematic Roles Assigned along the Garden Path Linger , 2001, Cognitive Psychology.

[12]  Jakob Uszkoreit,et al.  A Decomposable Attention Model for Natural Language Inference , 2016, EMNLP.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Allyson Ettinger,et al.  Assessing Composition in Sentence Vector Representations , 2018, COLING.

[15]  Rachel Rudinger,et al.  Lexicosyntactic Inference in Neural Models , 2018, EMNLP.

[16]  M. Tanenhaus Afterword The impact of “The cognitive basis for linguistic structures” , 2013 .

[17]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[18]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[19]  Stephen Clark,et al.  Cambridge: Parser Evaluation Using Textual Entailment by Grammatical Relation Comparison , 2010, SemEval@ACL.

[20]  Rachel Rudinger,et al.  Neural Models of Factuality , 2018, NAACL.

[21]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[22]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[23]  Luke S. Zettlemoyer,et al.  AllenNLP: A Deep Semantic Natural Language Processing Platform , 2018, ArXiv.

[24]  K. Rayner,et al.  Making and correcting errors during sentence comprehension: Eye movements in the analysis of structurally ambiguous sentences , 1982, Cognitive Psychology.

[25]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[26]  Alessandro Moschitti,et al.  Syntactic/Semantic Structures for Textual Entailment Recognition , 2010, NAACL.

[27]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[28]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[29]  Ewan Dunbar,et al.  RNNs Implicitly Implement Tensor Product Representations , 2018, ICLR.

[30]  R. Thomas McCoy,et al.  Non-entailed subsequences as a challenge for natural language inference , 2018, ArXiv.

[31]  Christopher Potts,et al.  A Fast Unified Model for Parsing and Sentence Understanding , 2016, ACL.

[32]  Daniel G. Bobrow,et al.  Entailment, intensionality and text understanding , 2003, HLT-NAACL 2003.

[33]  Samuel R. Bowman,et al.  Do latent tree learning models identify meaningful structure in sentences? , 2017, TACL.

[34]  Dhruv Batra,et al.  Analyzing the Behavior of Visual Question Answering Models , 2016, EMNLP.

[35]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP 2018.

[36]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[37]  Sebastian Riedel,et al.  Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness , 2018, NAACL.

[38]  Kevin Duh,et al.  Inference is Everything: Recasting Semantic Resources into a Unified Evaluation Framework , 2017, IJCNLP.

[39]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[40]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[41]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[42]  Daniel C. Richardson,et al.  Effects of merely local syntactic coherence on sentence processing , 2004 .

[43]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[44]  Samuel R. Bowman,et al.  Human vs. Muppet: A Conservative Estimate of Human Performance on the GLUE Benchmark , 2019, ACL.

[45]  Chris Callison-Burch,et al.  Tense Manages to Predict Implicative Behavior in Verbs , 2016, EMNLP.

[46]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[47]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[48]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.