UnNatural Language Inference

Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to understand human-like syntax, at least to some extent. We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are invariant to random word-order permutations. This behavior notably differs from that of humans; we struggle to understand the meaning of ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word order invariant. For example, in MNLI dataset we find almost all (98.7%) examples contain at least one permutation which elicits the gold label. Models are even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists in pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Chinese). Our code and data are available at https://github.com/facebookresearch/unlu.

[1]  J. M. Cattell THE TIME IT TAKES TO SEE AND NAME OBJECTS , 1886 .

[2]  H. Carr Tractatus Logico-Philosophicus , 1923, Nature.

[3]  M. Bunge Sense and reference , 1974 .

[4]  Eckart Scheerer,et al.  Early German approaches to experimental reading research: The contributions of Wilhelm Wundt and Ernst Meumann , 1981 .

[5]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[6]  László Dezsö,et al.  Universal Grammar , 1981, Certainty in Action.

[7]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[8]  Anne Abeillé,et al.  Lexical and Syntactic Rules in a Tree Adjoining Grammar , 1990, ACL.

[9]  Gennaro Chierchia,et al.  Meaning and Grammar: An Introduction to Semantics , 1990 .

[10]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[11]  Whitney Tabor,et al.  Syntactic innovation : a connectionist model , 1994 .

[12]  Irene Heim,et al.  Semantics in generative grammar , 1998 .

[13]  J. Zwart The Minimalist Program , 1998, Journal of Linguistics.

[14]  J. Bresnan Lexical-Functional Syntax , 2000 .

[15]  H Toyota Changes in the Constraints of Semantic and Syntactic Congruity on Memory across Three Age Groups , 2001, Perceptual and motor skills.

[16]  Daniel G. Bobrow,et al.  Entailment, intensionality and text understanding , 2003, HLT-NAACL 2003.

[17]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[18]  Lucy Vanderwende,et al.  What Syntax Can Contribute in the Entailment Task , 2005, MLCW.

[19]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[20]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[21]  A. Baddeley,et al.  Working memory and binding in sentence recall , 2009 .

[22]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[23]  Christopher D. Manning Computational Linguistics and Deep Learning , 2015, Computational Linguistics.

[24]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[25]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[26]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[27]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[28]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[29]  Jonathan Grainger,et al.  The sentence superiority effect revisited , 2017, Cognition.

[30]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[31]  SHALOM LAPPIN,et al.  Using Deep Neural Networks to Learn Syntactic Agreement , 2017, LILT.

[32]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[33]  Francis M. Tyers,et al.  Can LSTM Learn to Capture Agreement? The Case of Basque , 2018, BlackboxNLP@EMNLP.

[34]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[35]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[36]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.

[37]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[38]  Roger Levy,et al.  What do RNN Language Models Learn about Filler–Gap Dependencies? , 2018, BlackboxNLP@EMNLP.

[39]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.

[40]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[41]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[42]  Samuel R. Bowman,et al.  Do latent tree learning models identify meaningful structure in sentences? , 2017, TACL.

[43]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[44]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[45]  Tal Linzen,et al.  A Neural Model of Adaptation in Reading , 2018, EMNLP.

[46]  Rachel Rudinger,et al.  Lexicosyntactic Inference in Neural Models , 2018, EMNLP.

[47]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[48]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[49]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[50]  Carolyn Penstein Rosé,et al.  EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[51]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for neural networks , 2019, ArXiv.

[52]  Yoav Goldberg,et al.  Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages , 2019, NAACL.

[53]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[54]  Tal Linzen,et al.  Quantity doesn’t buy quality syntax with neural language models , 2019, EMNLP.

[55]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[56]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[57]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[58]  Tal Linzen,et al.  Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models , 2019, CoNLL.

[59]  Afra Alishahi,et al.  Correlating Neural and Symbolic Representations of Language , 2019, ACL.

[60]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[61]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[62]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[63]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[64]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[65]  Carolyn Penstein Rosé,et al.  Exploring Numeracy in Word Embeddings , 2019, ACL.

[66]  Jonathan Grainger,et al.  Word position coding in reading is noisy , 2019, Psychonomic Bulletin & Review.

[67]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[68]  Jonathan Grainger,et al.  Parallel, cascaded, interactive processing of words during sentence reading , 2019, Cognition.

[69]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[70]  Peng Qian,et al.  Representation of Constituents in Neural Language Models: Coordination Phrase as a Case Study , 2019, EMNLP.

[71]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[72]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[73]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[74]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[75]  Samuel R. Bowman,et al.  Can neural networks acquire a structural bias from raw linguistic data? , 2020, CogSci.

[76]  Robert D. Hawkins,et al.  Investigating Representations of Verb Bias in Neural Language Models , 2020, EMNLP.

[77]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[78]  Wenkai Zhang,et al.  SA-NLI: A Supervised Attention based framework for Natural Language Inference , 2020, Neurocomputing.

[79]  Nitish Gupta,et al.  Overestimation of Syntactic Representation in Neural Language Models , 2020, ACL.

[80]  Qun Liu,et al.  Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT , 2020, ACL.

[81]  Marco Baroni,et al.  Syntactic Structure from Deep Learning , 2020, Annual Review of Linguistics.

[82]  Roger P. Levy,et al.  A Systematic Assessment of Syntactic Generalization in Neural Language Models , 2020, ACL.

[83]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2019, BLACKBOXNLP.

[84]  Ke Xu,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[85]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[86]  Lawrence S. Moss,et al.  OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.

[87]  Evelina Fedorenko,et al.  Composition is the Core Driver of the Language-selective Network , 2020, Neurobiology of Language.

[88]  Forrest Davis,et al.  Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment , 2020, ACL.

[89]  J. Weston,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2019, ACL.

[90]  Rui P. Chaves,et al.  Assessing the ability of Transformer-based Neural Models to represent structurally unbounded dependencies , 2020, SCIL.

[91]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[92]  Rui P. Chaves,et al.  What Don’t RNN Language Models Learn About Filler-Gap Dependencies? , 2020, SCIL.

[93]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[94]  Roger Levy,et al.  SyntaxGym: An Online Platform for Targeted Evaluation of Language Models , 2020, ACL.

[95]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for deep neural networks , 2019, NeurIPS.

[96]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[97]  Koustuv Sinha,et al.  Sometimes We Want Translationese , 2021, ArXiv.

[98]  Douwe Kiela,et al.  Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[99]  Long Mai,et al.  Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? , 2020, FINDINGS.

[100]  Vivek Srikumar,et al.  BERT & Family Eat Word Salad: Experiments with Text Understanding , 2021, AAAI.

[101]  Hanlin Tang,et al.  Syntactic Perturbations Reveal Representational Correlates of Hierarchical Phrase Structure in Pretrained Language Models , 2021, REPL4NLP.