Towards Augmenting Lexical Resources for Slang and African American English

Researchers in natural language processing have developed large, robust resources for understanding formal Standard American English (SAE), but we lack similar resources for variations of English, such as slang and African American English (AAE). In this work, we use word embeddings and clustering algorithms to group semantically similar words in three datasets, two of which contain high incidence of slang and AAE. Since high-quality clusters would contain related words, we could also infer the meaning of an unfamiliar word based on the meanings of words clustered with it. After clustering, we compute precision and recall scores using WordNet and ConceptNet as gold standards and show that these scores are unimportant when the given resources do not fully represent slang and AAE. Amazon Mechanical Turk and expert evaluations show that clusters with low precision can still be considered high quality, and we propose the new Cluster Split Score as a metric for machine-generated clusters. These contributions emphasize the gap in natural language processing research for variations of English and motivate further work to close it.

[1]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[2]  Cynthia Whissell,et al.  THE DICTIONARY OF AFFECT IN LANGUAGE , 1989 .

[3]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[4]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[5]  Jure Leskovec,et al.  Inducing Domain-Specific Sentiment Lexicons from Unlabeled Corpora , 2016, EMNLP.

[6]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[7]  Rada Mihalcea,et al.  Unsupervised Graph-basedWord Sense Disambiguation Using Measures of Word Semantic Similarity , 2007 .

[8]  Catherine Havasi,et al.  ConceptNet 5.5: An Open Multilingual Graph of General Knowledge , 2016, AAAI.

[9]  Rahul Goel,et al.  The Social Dynamics of Language Change in Online Networks , 2016, SocInfo.

[10]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[11]  Brendan T. O'Connor,et al.  Demographic Dialectal Variation in Social Media: A Case Study of African-American English , 2016, EMNLP.

[12]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[13]  Kathy McKeown,et al.  Detecting Gang-Involved Escalation on Social Media Using Context , 2018, EMNLP.

[14]  Jacob Eisenstein,et al.  Making “fetch” happen: The influence of social and linguistic context on nonstandard word growth and decline , 2018, EMNLP.

[15]  Maria das Graças Volpe Nunes,et al.  Exploring Word Embeddings for Unsupervised Textual User-Generated Content Normalization , 2016, NUT@COLING.

[16]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[17]  Owen Rambow,et al.  Automatically Processing Tweets from Gang-Involved Youth: Towards Detecting Loss and Aggression , 2016, COLING.

[18]  Pushpak Bhattacharyya,et al.  SlangNet: A WordNet like resource for English Slang , 2016, LREC.

[19]  Jacob Eisenstein,et al.  Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling , 2019, EMNLP.

[20]  Markus Egg,et al.  A Large Automatically-Acquired All-Words List of Multiword Expressions Scored for Compositionality , 2018, LREC.

[21]  Methods, innovations and extensions: Reflections on half a century of methodology in social dialectology† , 2016 .