Posing Fair Generalization Tasks for Natural Language Inference

Deep learning models for semantics are generally evaluated using naturalistic corpora. Adversarial testing methods, in which models are evaluated on new examples with known semantic properties, have begun to reveal that good performance at these naturalistic tasks can hide serious shortcomings. However, we should insist that these evaluations be fair – that the models are given data sufficient to support the requisite kinds of generalization. In this paper, we define and motivate a formal notion of fairness in this sense. We then apply these ideas to natural language inference by constructing very challenging but provably fair artificial datasets and showing that standard neural models fail to generalize in the required ways; only task-specific models that jointly compose the premise and hypothesis are able to achieve high performance, and even these models do not solve the task perfectly.

[1]  Jeffrey Pennington,et al.  Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions , 2011, EMNLP.

[2]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..

[3]  J.F.A.K. van Benthem A brief history of natural logic [in Chinese] , 2009 .

[4]  Andrew Y. Ng,et al.  Parsing Natural Scenes and Natural Language with Recursive Neural Networks , 2011, ICML.

[5]  Johan Bos,et al.  HELP: A Dataset for Identifying Shortcomings of Neural Models in Monotonicity Reasoning , 2019, *SEMEVAL.

[6]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[7]  Mohit Bansal,et al.  Analyzing Compositionality-Sensitivity of NLI Models , 2018, AAAI.

[8]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[9]  Victor Sanchez,et al.  Studies on Natural Logic and Categorial Grammar , 1991 .

[10]  Christopher D. Manning,et al.  Natural Logic for Textual Inference , 2007, ACL-PASCAL@ACL.

[11]  Victor Manual Sánchez Valencia,et al.  Studies on natural logic and categorial grammar , 1991 .

[12]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[13]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[14]  Thomas F. Icard III,et al.  Recent Progress on Monotonicity , 2014, LILT.

[15]  Jonathon Shlens,et al.  Explaining and Harnessing Adversarial Examples , 2014, ICLR.

[16]  Christopher Potts,et al.  Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences , 2018, ArXiv.

[17]  Phil Blunsom,et al.  Reasoning about Entailment with Neural Attention , 2015, ICLR.

[18]  Marco Baroni,et al.  Still not systematic after all these years: On the compositional skills of sequence-to-sequence recurrent networks , 2017, ICLR 2018.

[19]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.

[20]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[21]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[22]  Christopher D. Manning,et al.  An extended model of natural logic , 2009, IWCS.

[23]  Richard Evans,et al.  Can Neural Networks Understand Logical Entailment? , 2018, ICLR.

[24]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[25]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[26]  Christopher Potts,et al.  Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank , 2013, EMNLP.

[27]  Samuel R. Bowman Can recursive neural tensor networks learn logical reasoning? , 2014, ICLR.

[28]  J. Benthem A brief history of natural logic , 2008 .

[29]  Yoav Goldberg,et al.  Breaking NLI Systems with Sentences that Require Simple Lexical Inferences , 2018, ACL.

[30]  Christopher Potts,et al.  Recursive Neural Networks Can Learn Logical Semantics , 2014, CVSC.

[31]  Percy Liang,et al.  Adversarial Examples for Evaluating Reading Comprehension Systems , 2017, EMNLP.