UnNatural Language Inference

Natural Language Understanding has witnessed a watershed moment with the introduction of large pre-trained Transformer networks. These models achieve state-of-the-art on various tasks, notably including Natural Language Inference (NLI). Many studies have shown that the large representation space imbibed by the models encodes some syntactic and semantic information. However, to really “know syntax”, a model must recognize when its input violates syntactic rules and calculate inferences accordingly. In this work, we find that stateof-the-art NLI models, such as RoBERTa and BART are invariant to, and sometimes even perform better on, examples with randomly reordered words. With iterative search, we are able to construct randomized versions of NLI test sets, which contain permuted hypothesispremise pairs with the same words as the original, yet are classified with perfect accuracy by large pre-trained models, as well as preTransformer state-of-the-art encoders. We find the issue to be language and model invariant, and hence investigate the root cause. To partially alleviate this effect, we propose a simple training methodology. Our findings call into question the idea that our natural language understanding models, and the tasks used for measuring their progress, genuinely require a human-like understanding of syntax.

[1]  Anne Abeillé,et al.  Lexical and Syntactic Rules in a Tree Adjoining Grammar , 1990, ACL.

[2]  Omer Levy,et al.  SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems , 2019, NeurIPS.

[3]  Ido Dagan,et al.  The Third PASCAL Recognizing Textual Entailment Challenge , 2007, ACL-PASCAL@ACL.

[4]  Adina Williams,et al.  Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition , 2020, ACL.

[5]  Marco Baroni,et al.  Syntactic Structure from Deep Learning , 2020, Annual Review of Linguistics.

[6]  Samuel R. Bowman,et al.  Language Modeling Teaches You More than Translation Does: Lessons Learned Through Auxiliary Syntactic Task Analysis , 2018, BlackboxNLP@EMNLP.

[7]  Vivek Srikumar,et al.  BERT & Family Eat Word Salad: Experiments with Text Understanding , 2021, AAAI.

[8]  Rui P. Chaves,et al.  What Don’t RNN Language Models Learn About Filler-Gap Dependencies? , 2020, SCIL.

[9]  Omer Levy,et al.  Emergent linguistic structure in artificial neural networks trained by self-supervision , 2020, Proceedings of the National Academy of Sciences.

[10]  Shikha Bordia,et al.  Investigating BERT’s Knowledge of Language: Five Analysis Methods with NPIs , 2019, EMNLP.

[11]  Evelina Fedorenko,et al.  Composition is the Core Driver of the Language-selective Network , 2020, Neurobiology of Language.

[12]  Aravind K. Joshi,et al.  Parsing Strategies with ‘Lexicalized’ Grammars: Application to Tree Adjoining Grammars , 1988, COLING.

[13]  Rui P. Chaves,et al.  Assessing the ability of Transformer-based Neural Models to represent structurally unbounded dependencies , 2020, SCIL.

[14]  Tal Linzen,et al.  Quantity doesn’t buy quality syntax with neural language models , 2019, EMNLP.

[15]  Forrest Davis,et al.  Recurrent Neural Network Language Models Always Learn English-Like Relative Clause Attachment , 2020, ACL.

[16]  Qun Liu,et al.  Perturbed Masking: Parameter-free Probing for Analyzing and Interpreting BERT , 2020, ACL.

[17]  Jonathan Grainger,et al.  The sentence superiority effect revisited , 2017, Cognition.

[18]  Koustuv Sinha,et al.  Sometimes We Want Translationese , 2021, ArXiv.

[19]  Tal Linzen,et al.  A Neural Model of Adaptation in Reading , 2018, EMNLP.

[20]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[21]  R. Thomas McCoy,et al.  Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference , 2019, ACL.

[22]  Yoav Goldberg,et al.  Assessing BERT's Syntactic Abilities , 2019, ArXiv.

[23]  Zellig S. Harris,et al.  Distributional Structure , 1954 .

[24]  Furu Wei,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[25]  H Toyota Changes in the Constraints of Semantic and Syntactic Congruity on Memory across Three Age Groups , 2001, Perceptual and motor skills.

[26]  Rachel Rudinger,et al.  Hypothesis Only Baselines in Natural Language Inference , 2018, *SEMEVAL.

[27]  Ivan A. Sag,et al.  Book Reviews: Head-driven Phrase Structure Grammar and German in Head-driven Phrase-structure Grammar , 1996, CL.

[28]  Jonathan Grainger,et al.  Word position coding in reading is noisy , 2019, Psychonomic Bulletin & Review.

[29]  Carolyn Penstein Rosé,et al.  Stress Test Evaluation for Natural Language Inference , 2018, COLING.

[30]  Nitish Gupta,et al.  Overestimation of Syntactic Representation in Neural Language Models , 2020, ACL.

[31]  Irene Heim,et al.  Semantics in generative grammar , 1998 .

[32]  M. Bunge Sense and reference , 1974 .

[33]  J. Bresnan Lexical-Functional Syntax , 2000 .

[34]  Samuel R. Bowman,et al.  Neural Network Acceptability Judgments , 2018, Transactions of the Association for Computational Linguistics.

[35]  Sameer Singh,et al.  Universal Adversarial Triggers for Attacking and Analyzing NLP , 2019, EMNLP.

[36]  Zhen-Hua Ling,et al.  Enhanced LSTM for Natural Language Inference , 2016, ACL.

[37]  Whitney Tabor,et al.  Syntactic innovation : a connectionist model , 1994 .

[38]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[39]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[40]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for neural networks , 2019, ArXiv.

[41]  Dipanjan Das,et al.  BERT Rediscovers the Classical NLP Pipeline , 2019, ACL.

[42]  László Dezsö,et al.  Universal Grammar , 1981, Certainty in Action.

[43]  Lucy Vanderwende,et al.  What Syntax Can Contribute in the Entailment Task , 2005, MLCW.

[44]  Tudor Dumitras,et al.  Shallow-Deep Networks: Understanding and Mitigating Network Overthinking , 2018, ICML.

[45]  Francis M. Tyers,et al.  Can LSTM Learn to Capture Agreement? The Case of Basque , 2018, BlackboxNLP@EMNLP.

[46]  Peng Qian,et al.  Representation of Constituents in Neural Language Models: Coordination Phrase as a Case Study , 2019, EMNLP.

[47]  Robert Frank,et al.  Open Sesame: Getting inside BERT’s Linguistic Knowledge , 2019, BlackboxNLP@ACL.

[48]  Masatoshi Tsuchiya,et al.  Performance Impact Caused by Hidden Bias of Training Data for Recognizing Textual Entailment , 2018, LREC.

[49]  R. Thomas McCoy,et al.  BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance , 2020, BLACKBOXNLP.

[50]  Anna Rumshisky,et al.  A Primer in BERTology: What We Know About How BERT Works , 2020, Transactions of the Association for Computational Linguistics.

[51]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[52]  Samuel R. Bowman,et al.  Do latent tree learning models identify meaningful structure in sentences? , 2017, TACL.

[53]  Thomas L. Griffiths,et al.  Investigating representations of verb bias in neural language models , 2020, EMNLP.

[54]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[55]  Wenkai Zhang,et al.  SA-NLI: A Supervised Attention based framework for Natural Language Inference , 2020, Neurocomputing.

[56]  Roger P. Levy,et al.  A Systematic Assessment of Syntactic Generalization in Neural Language Models , 2020, ACL.

[57]  Long Mai,et al.  Out of Order: How important is the sequential order of words in a sentence in Natural Language Understanding tasks? , 2020, FINDINGS.

[58]  Roger Levy,et al.  What do RNN Language Models Learn about Filler–Gap Dependencies? , 2018, BlackboxNLP@EMNLP.

[59]  Douwe Kiela,et al.  Masked Language Modeling and the Distributional Hypothesis: Order Word Matters Pre-training for Little , 2021, EMNLP.

[60]  Eckart Scheerer,et al.  Early German approaches to experimental reading research: The contributions of Wilhelm Wundt and Ernst Meumann , 1981 .

[61]  Rachel Rudinger,et al.  Lexicosyntactic Inference in Neural Models , 2018, EMNLP.

[62]  Samuel R. Bowman,et al.  BLiMP: A Benchmark of Linguistic Minimal Pairs for English , 2019, SCIL.

[63]  Han Zhao,et al.  Self-Adaptive Hierarchical Sentence Model , 2015, IJCAI.

[64]  Omer Levy,et al.  What Does BERT Look at? An Analysis of BERT’s Attention , 2019, BlackboxNLP@ACL.

[65]  Daniel G. Bobrow,et al.  Entailment, intensionality and text understanding , 2003, HLT-NAACL 2003.

[66]  J. Zwart The Minimalist Program , 1998, Journal of Linguistics.

[67]  Roger Levy,et al.  SyntaxGym: An Online Platform for Targeted Evaluation of Language Models , 2020, ACL.

[68]  Brenden M. Lake,et al.  Mutual exclusivity as a challenge for deep neural networks , 2019, NeurIPS.

[69]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[70]  Jason Weston,et al.  A unified architecture for natural language processing: deep neural networks with multitask learning , 2008, ICML '08.

[71]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[72]  Mohit Bansal,et al.  Adversarial NLI: A New Benchmark for Natural Language Understanding , 2020, ACL.

[73]  Mark Chen,et al.  Language Models are Few-Shot Learners , 2020, NeurIPS.

[74]  Gennaro Chierchia,et al.  Meaning and Grammar: An Introduction to Semantics , 1990 .

[75]  Emily M. Bender,et al.  Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data , 2020, ACL.

[76]  Holger Schwenk,et al.  Supervised Learning of Universal Sentence Representations from Natural Language Inference Data , 2017, EMNLP.

[77]  J. M. Cattell THE TIME IT TAKES TO SEE AND NAME OBJECTS , 1886 .

[78]  Lawrence S. Moss,et al.  OCNLI: Original Chinese Natural Language Inference , 2020, FINDINGS.

[79]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[80]  A. Baddeley,et al.  Working memory and binding in sentence recall , 2009 .

[81]  Samuel R. Bowman,et al.  A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , 2017, NAACL.

[82]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[83]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[84]  Omer Levy,et al.  Annotation Artifacts in Natural Language Inference Data , 2018, NAACL.

[85]  Carolyn Penstein Rosé,et al.  Exploring Numeracy in Word Embeddings , 2019, ACL.

[86]  Carolyn Penstein Rosé,et al.  EQUATE: A Benchmark Evaluation Framework for Quantitative Reasoning in Natural Language Inference , 2019, CoNLL.

[87]  Hanlin Tang,et al.  Syntactic Perturbations Reveal Representational Correlates of Hierarchical Phrase Structure in Pretrained Language Models , 2021, REPL4NLP.

[88]  Tal Linzen,et al.  Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models , 2019, CoNLL.

[89]  Noah D. Goodman,et al.  Evaluating Compositionality in Sentence Embeddings , 2018, CogSci.

[90]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[91]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[92]  Afra Alishahi,et al.  Correlating Neural and Symbolic Representations of Language , 2019, ACL.

[93]  SHALOM LAPPIN,et al.  Using Deep Neural Networks to Learn Syntactic Agreement , 2017, LILT.

[94]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[95]  Christopher D. Manning Computational Linguistics and Deep Learning , 2015, Computational Linguistics.

[96]  Jonathan Grainger,et al.  Parallel, cascaded, interactive processing of words during sentence reading , 2019, Cognition.

[97]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[98]  Yoav Goldberg,et al.  Studying the Inductive Biases of RNNs with Synthetic Variations of Natural Languages , 2019, NAACL.

[99]  H. Carr Tractatus Logico-Philosophicus , 1923, Nature.

[100]  Samuel R. Bowman,et al.  Can neural networks acquire a structural bias from raw linguistic data? , 2020, CogSci.

[101]  Tal Linzen,et al.  Targeted Syntactic Evaluation of Language Models , 2018, EMNLP.