Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words

How does the input segmentation of pretrained language models (PLMs) affect their interpretations of complex words? We present the first study investigating this question, taking BERT as the example PLM and focusing on its semantic representations of English derivatives. We show that PLMs can be interpreted as serial dual-route models, i.e., the meanings of complex words are either stored or else need to be computed from the subwords, which implies that maximally meaningful input tokens should allow for the best generalization on new words. This hypothesis is confirmed by a series of semantic probing tasks on which DelBERT (Derivation leveraging BERT), a model with derivational input segmentation, substantially outperforms BERT with WordPiece segmentation. Our results suggest that the generalization capabilities of PLMs could be further improved if a morphologically-informed vocabulary of input tokens were used.

[1]  L. Feldman Modeling Morphological Processing , 2013 .

[2]  Pius ten Hacken Delineating Derivation and Inflection , 2014 .

[3]  M. Taft Recognition of affixed words and the word frequency effect , 1979, Memory & cognition.

[4]  Ryan Cotterell,et al.  Joint Semantic Synthesis and Morphological Analysis of the Derived Word , 2017, TACL.

[5]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[6]  Mikko Kurimo,et al.  Finnish ASR with Deep Transformer Models , 2020, INTERSPEECH.

[7]  Marcus Taft,et al.  Reading and the Mental Lexicon , 1991 .

[8]  K. Rastle,et al.  The processing of singular and plural nouns in French and English , 2004 .

[9]  R. Harald Baayen,et al.  Morphological dynamics in compound processing , 2008 .

[10]  Shafiq R. Joty,et al.  Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding , 2020, EMNLP.

[11]  Suzanna Sia,et al.  Tired of Topic Models? Clusters of Pretrained Word Embeddings Make for Fast and Good Topics too! , 2020, EMNLP.

[12]  J. Pierrehumbert,et al.  Morphological convergence as on-line lexical analogy , 2020, Language.

[13]  M. Taft Prefix Stripping Revisited. , 1981 .

[14]  Laurie Beth Feldman,et al.  Morphological aspects of language processing. , 1997 .

[15]  Laurent Romary,et al.  CamemBERT: a Tasty French Language Model , 2019, ACL.

[16]  Kenneth Ward Church,et al.  Emerging trends: Subwords, seriously? , 2020, Natural Language Engineering.

[17]  Philip Gage,et al.  A new algorithm for data compression , 1994 .

[18]  Parminder Bhatia,et al.  Morphological Priors for Probabilistic Neural Word Embeddings , 2016, EMNLP.

[19]  Hinrich Schütze,et al.  DagoBERT: Generating Derivational Morphology with a Pretrained Language Model , 2020, EMNLP.

[20]  Phil Blunsom,et al.  Compositional Morphology for Word Representations and Language Modelling , 2014, ICML.

[21]  Alexander M. Rush,et al.  Character-Aware Neural Language Models , 2015, AAAI.

[22]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[23]  A. Laudanna,et al.  Distributional properties of derivational affixes: Implications for processing , 1995 .

[24]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[25]  R. Baayen,et al.  Reading polymorphemic Dutch compounds: toward a multiple route model of lexical processing. , 2009, Journal of experimental psychology. Human perception and performance.

[26]  Daniel Edmiston,et al.  A Systematic Analysis of Morphological Content in BERT Models for Multiple Languages , 2020, ArXiv.

[27]  Jan Snajder,et al.  Obtaining a Better Understanding of Distributional Models of German Derivational Morphology , 2015, IWCS.

[28]  Benjamin Heinzerling,et al.  Sequence Tagging with Contextual and Non-Contextual Subword Representations: A Multilingual Evaluation , 2019, ACL.

[29]  Lars Borin,et al.  What is a lexical representation? , 1985, NODALIDA.

[30]  C. Pliatsikas,et al.  Morphological processing in the brain: the good (inflection), the bad (derivation) and the ugly (compounding) , 2019, Cortex.

[31]  Jianmo Ni,et al.  Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects , 2019, EMNLP.

[32]  Ryan Cotterell,et al.  Context-Aware Prediction of Derivational Word-forms , 2017, EACL.

[33]  Quoc V. Le,et al.  ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators , 2020, ICLR.

[34]  Marcus Taft,et al.  Interactive-activation as a framework for understanding morphological processing , 1994 .

[35]  Aline Villavicencio,et al.  Incorporating Subword Information into Matrix Factorization Word Embeddings , 2018, ArXiv.

[36]  L. Manelis,et al.  The processing of affixed words , 1977, Memory & cognition.

[37]  Benoît Sagot,et al.  What Does BERT Learn about the Structure of Language? , 2019, ACL.

[38]  Elena Paslaru Bontas Simperl,et al.  A Query Log Analysis of Dataset Search , 2017, ICWE.

[39]  Yiming Yang,et al.  XLNet: Generalized Autoregressive Pretraining for Language Understanding , 2019, NeurIPS.

[40]  Jacob Eisenstein,et al.  Will it Unblend? , 2020, FINDINGS.

[41]  M. Taft A morphological-decomposition model of lexical representation , 1988 .

[42]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[43]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[44]  Hinrich Schütze,et al.  Predicting the Growth of Morphological Families from Social and Linguistic Factors , 2020, ACL.

[45]  Christopher D. Manning,et al.  A Structural Probe for Finding Syntax in Word Representations , 2019, NAACL.

[46]  Jonathan Grainger,et al.  Oil the role of derivational affixes in recognizing complex words: Evidence from masked priming , 2003 .

[47]  HARALD BAAYEN,et al.  Productivity and English derivation: a corpus-based study , 1991 .

[48]  Matthew H. Davis,et al.  Morphological decomposition based on the analysis of orthography , 2008 .

[49]  Ting Liu,et al.  CharBERT: Character-aware Pre-trained Language Model , 2020, COLING.

[50]  J. Pierrehumbert,et al.  Gendered associations of English morphology , 2018 .

[51]  R. Baayen,et al.  Singulars and plurals in Dutch: Evidence for a parallel dual-route model , 1997 .

[52]  Joseph P. Stemberger,et al.  Rule-Less Morphology at the Phonology-Lexicon Interface , 1994 .

[53]  K. Forster,et al.  Lexical storage and retrieval of prefixed words , 1975 .

[54]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[55]  Yi Yang,et al.  Overcoming Language Variation in Sentiment Analysis with Social Attention , 2015, TACL.

[56]  Alfonso Caramazza,et al.  Representation and processing of derived words , 1987 .

[57]  Goran Glavas,et al.  Probing Pretrained Language Models for Lexical Semantics , 2020, EMNLP.

[58]  Mari Ostendorf,et al.  Exponential Language Modeling Using Morphological Features and Multi-Task Learning , 2015, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[59]  Roy Schwartz,et al.  Show Your Work: Improved Reporting of Experimental Results , 2019, EMNLP.

[60]  Alessandro Laudanna,et al.  Chapter 18 Units of Representation for Derived Words in the Lexicon , 1992 .

[61]  Robert Schreuder,et al.  Constraining psycholinguistic models of morphological processing and representation: The role of productivity , 1992 .

[62]  Joan L. Bybee,et al.  Regular morphology and the lexicon. , 1995 .

[63]  Matej Klemen,et al.  Enhancing deep neural networks with morphological information , 2020, ArXiv.

[64]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[65]  Kawin Ethayarajh,et al.  How Contextual are Contextualized Word Representations? Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings , 2019, EMNLP.

[66]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[67]  Dan Roth,et al.  A Distributional and Orthographic Aggregation Model for English Derivational Morphology , 2018, ACL.

[68]  R. Holloway The broth in my brother ’ s brothel : Morpho-orthographic segmentation in visual word recognition , 2005 .

[69]  Heike Adel,et al.  Overview of Character-Based Models for Natural Language Processing , 2017, CICLing.

[70]  C. Fowler,et al.  The inflected noun system in Serbo-Croatian: Lexical representation of morphological structure , 1987, Memory & cognition.

[71]  B. Butterworth,et al.  Language Production II: Development, Writing, and Other Language Processes , 1985 .

[72]  R. H. Baayen,et al.  Morphology in the Mental Lexicon: A Computational Model for Visual Word Recognition , 2000 .

[73]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[74]  Richard A. Harshman,et al.  Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..

[75]  Matthew Henderson,et al.  Efficient Intent Detection with Dual Sentence Encoders , 2020, NLP4CONVAI.

[76]  David Yarowsky,et al.  Paradigm Completion for Derivational Morphology , 2017, EMNLP.

[77]  Hinrich Schütze,et al.  Word Space , 1992, NIPS.

[78]  Marco Marelli,et al.  Compositional-ly Derived Representations of Morphologically Complex Words in Distributional Semantics , 2013, ACL.

[79]  Hinrich Schutze,et al.  Negated LAMA: Birds cannot fly , 2019, ArXiv.

[80]  Ingo Plag,et al.  Word-Formation in English , 2018 .

[81]  M. Taft Morphological Decomposition and the Reverse Base Frequency Effect , 2004, The Quarterly journal of experimental psychology. A, Human experimental psychology.

[82]  Gregory Stump,et al.  Some sources of apparent gaps in derivational paradigms , 2018, Morphology.

[83]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[84]  R. Baayen,et al.  Affixal Homonymy triggers full-form storage, even with inflected words, even in a morphologically rich language , 2000, Cognition.

[85]  Andrew Gordon Wilson,et al.  Probabilistic FastText for Multi-Sense Word Embeddings , 2018, ACL.

[86]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[87]  Hinrich Schütze,et al.  A Graph Auto-encoder Model of Derivational Morphology , 2020, ACL.

[88]  P. Gordon,et al.  Frequency Effects and the Representational Status of Regular Inflections , 1999 .

[89]  Jan Snajder,et al.  Predictability of Distributional Semantics in Derivational Word Formation , 2016, COLING.

[90]  Tie-Yan Liu,et al.  Co-learning of Word Representations and Morpheme Representations , 2014, COLING.

[91]  Mike Schuster,et al.  Japanese and Korean voice search , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[92]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[93]  Allyson Ettinger,et al.  What BERT Is Not: Lessons from a New Suite of Psycholinguistic Diagnostics for Language Models , 2019, TACL.

[94]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[95]  Jonathan Grainger,et al.  Differences in the Processing of Prefixes and Suffixes Revealed by a Letter-Search Task , 2015 .

[96]  A. Laudanna,et al.  Address mechanisms to decomposed lexical entries , 1985 .

[97]  R. Baayen,et al.  The balance of storage and computation in morphological processing: the role of word formation type, affixal homonymy, and productivity. , 2000, Journal of experimental psychology. Learning, memory, and cognition.