Do Word Embeddings Capture Spelling Variation?

Analyses of word embeddings have primarily focused on semantic and syntactic properties. However, word embeddings have the potential to encode other properties as well. In this paper, we propose a new perspective on the analysis of word embeddings by focusing on spelling variation. In social media, spelling variation is abundant and often socially meaningful. Here, we analyze word embeddings trained on Twitter and Reddit data. We present three analyses using pairs of word forms covering seven types of spelling variation in English. Taken together, our results show that word embeddings encode spelling variation patterns of various types to some extent, even embeddings trained using the skipgram model which does not take spelling into account. Our results also suggest a link between the intentionality of the variation and the distance of the non-conventional spellings to their conventional spellings.

[1]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[2]  Satoshi Matsuoka,et al.  Analogy-based detection of morphological and semantic relations with word embeddings: what works and what doesn’t. , 2016, NAACL.

[3]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[4]  Michael Stubbs,et al.  Spelling and society: The culture and politics of orthography around the world , 2009 .

[5]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[6]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[7]  Tal Linzen,et al.  Issues in evaluating semantic spaces using word analogies , 2016, RepEval@ACL.

[8]  Fabrizio Silvestri,et al.  Misspelling Oblivious Word Embeddings , 2019, NAACL.

[9]  Daniel Jurafsky,et al.  Word embeddings quantify 100 years of gender and ethnic stereotypes , 2017, Proceedings of the National Academy of Sciences.

[10]  Omer Levy,et al.  Do Supervised Distributional Methods Really Learn Lexical Inference Relations? , 2015, NAACL.

[11]  Guillaume Lample,et al.  What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties , 2018, ACL.

[12]  Yang Liu,et al.  A Character-Level Machine Translation Approach for Normalization of SMS Abbreviations , 2011, IJCNLP.

[13]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[14]  T. F. J. Shortis,et al.  Orthographic practices in SMS text messaging as a case signifying diachronic change in linguistic and semiotic resources , 2016 .

[15]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[16]  Timothy Baldwin,et al.  Lexical Normalisation of Short Text Messages: Makn Sens a #twitter , 2011, ACL.

[17]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[18]  Georgiana Dinu,et al.  Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors , 2014, ACL.

[19]  E. Levon,et al.  Social Salience and the Sociolinguistic Monitor , 2014 .

[20]  Nicholas Diakopoulos,et al.  Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs , 2011, EMNLP.

[21]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[22]  Willem H. Zuidema,et al.  Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure , 2017, J. Artif. Intell. Res..

[23]  Rachael Tatman #go awn: Sociophonetic Variation in Variant Spellings on Twitter , 2015 .

[24]  Sharon Goldwater,et al.  Inducing a lexicon of sociolinguistic variables from code-mixed text , 2018, NUT@EMNLP.

[25]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[26]  Caroline Tagg,et al.  A corpus linguistics study of SMS text messaging , 2009 .

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Fabian Flöck,et al.  Demographic Inference and Representative Population Estimates from Multilingual Social Media Data , 2019, WWW.

[29]  Bofang Li,et al.  The (too Many) Problems of Analogical Reasoning with Word Vectors , 2017, *SEMEVAL.

[30]  Brendan T. O'Connor,et al.  TweetMotif: Exploratory Search and Topic Summarization for Twitter , 2010, ICWSM.

[31]  J. Grieve,et al.  Analyzing lexical emergence in Modern American English online 1 , 2016, English Language and Linguistics.

[32]  Fei Liu,et al.  Insertion, Deletion, or Substitution? Normalizing Text Messages without Pre-categorization nor Supervision , 2011, ACL.

[33]  Omer Levy,et al.  Linguistic Regularities in Sparse and Explicit Word Representations , 2014, CoNLL.

[34]  Yonatan Belinkov,et al.  Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks , 2016, ICLR.

[35]  Jacob Eisenstein Systematic patterning in phonologically‐motivated orthographic variation , 2015 .

[36]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[37]  Gerard de Melo,et al.  Exploring Semantic Properties of Sentence Embeddings , 2018, ACL.

[38]  Anna Rumshisky,et al.  What’s in Your Embedding, And How It Predicts Task Performance , 2018, COLING.

[39]  Meredith Tamminga Matched guise effects can be robust to speech style. , 2017, The Journal of the Acoustical Society of America.

[40]  Eduard H. Hovy,et al.  Unsupervised Mining of Lexical Variants from Noisy Text , 2011, ULNLP@EMNLP.

[41]  Gertjan van Noord,et al.  A Taxonomy for In-depth Evaluation of Normalization for User Generated Content , 2018, LREC.

[42]  Yonatan Belinkov,et al.  Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[43]  Thorsten Joachims,et al.  Evaluation methods for unsupervised word embeddings , 2015, EMNLP.

[44]  Marine Carpuat,et al.  Discovering Stylistic Variations in Distributional Vector Space Models via Lexical Paraphrases , 2017 .