Exploring BERT's Sensitivity to Lexical Cues using Tests from Semantic Priming

Models trained to estimate word probabilities in context have become ubiquitous in natural language processing. How do these models use lexical cues in context to inform their word probabilities? To answer this question, we present a case study analyzing the pre-trained BERT model with tests informed by semantic priming. Using English lexical stimuli that show priming in humans, we find that BERT too shows "priming," predicting a word with greater probability when the context includes a related word versus an unrelated one. This effect decreases as the amount of information provided by the context increases. Follow-up analysis shows BERT to be increasingly distracted by related prime words as context becomes more informative, assigning lower probabilities to related words. Our findings highlight the importance of considering contextual constraint effects when studying word prediction in these models, and highlight possible parallels with human processing.

[1]  Nathanael Chambers,et al.  A Corpus and Cloze Evaluation for Deeper Understanding of Commonsense Stories , 2016, NAACL.

[2]  M. Kutas,et al.  Reading senseless sentences: brain potentials reflect semantic incongruity. , 1980, Science.

[3]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[4]  Tal Linzen,et al.  Modeling garden path effects without explicit hierarchical syntax , 2018, CogSci.

[5]  Allyson Ettinger,et al.  Evaluating vector space models using human semantic priming results , 2016, RepEval@ACL.

[6]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[7]  John Hale,et al.  A Probabilistic Earley Parser as a Psycholinguistic Model , 2001, NAACL.

[8]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[9]  Tal Linzen,et al.  Using Priming to Uncover the Organization of Syntactic Representations in Neural Language Models , 2019, CoNLL.

[10]  R. Levy Expectation-based syntactic comprehension , 2008, Cognition.

[11]  Emmanuel Dupoux,et al.  Assessing the Ability of LSTMs to Learn Syntax-Sensitive Dependencies , 2016, TACL.

[12]  Roger Levy,et al.  What do RNN Language Models Learn about Filler–Gap Dependencies? , 2018, BlackboxNLP@EMNLP.

[13]  Sebastian Ruder,et al.  Universal Language Model Fine-tuning for Text Classification , 2018, ACL.

[14]  Nathanael Chambers,et al.  LSDSem 2017 Shared Task: The Story Cloze Test , 2017, LSDSem@EACL.

[15]  Roger Levy,et al.  Neural language models as psycholinguistic subjects: Representations of syntactic state , 2019, NAACL.

[16]  Stefan Frank,et al.  The interaction between structure and meaning in sentence comprehension: Recurrent neural networks and reading times , 2019, CogSci.

[17]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[18]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[19]  Gabriella Vigliocco,et al.  Word surprisal predicts N400 amplitude during reading , 2013, ACL.

[20]  Nathaniel J. Smith,et al.  The effect of word predictability on reading time is logarithmic , 2013, Cognition.

[21]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[22]  Edouard Grave,et al.  Colorless Green Recurrent Networks Dream Hierarchically , 2018, NAACL.

[23]  Allyson Ettinger What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, Transactions of the Association for Computational Linguistics.

[24]  Kara D. Federmeier,et al.  A Rose by Any Other Name: Long-Term Memory Structure and Sentence Processing , 1999 .

[25]  Benoît Favre,et al.  Evaluation of word embeddings against cognitive processes: primed reaction times in lexical decision and naming tasks , 2017, RepEval@EMNLP.

[26]  P. Schwanenflugel Chapter 2 Contextual Constraint and Lexical Processing , 1991 .

[27]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[28]  P. Schwanenflugel,et al.  Semantic relatedness and the scope of facilitation for upcoming words in sentences. , 1988 .

[29]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[30]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[31]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[32]  David A. Balota,et al.  The semantic priming project , 2013, Behavior Research Methods.

[33]  Paula J Schwanenflugel,et al.  The influence of contextual constraints on recall for words within sentences. , 2002, The American journal of psychology.

[34]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[35]  Sebastian Riedel,et al.  Language Models as Knowledge Bases? , 2019, EMNLP.

[36]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[37]  T. McNamara Semantic Priming: Perspectives from Memory and Word Recognition , 2005 .