Characterizing English Variation across Social Media Communities with BERT

Abstract Much previous work characterizing language variation across Internet social groups has focused on the types of words used by these groups. We extend this type of study by employing BERT to characterize variation in the senses of words as well, analyzing two months of English comments in 474 Reddit communities. The specificity of different sense clusters to a community, combined with the specificity of a community’s unique word types, is used to identify cases where a social group’s language deviates from the norm. We validate our metrics using user-created glossaries and draw on sociolinguistic theories to connect language variation with trends in community behavior. We find that communities with highly distinctive language are medium-sized, and their loyal and highly engaged users interact in dense networks.

[1]  M. R. Brito,et al.  Connectivity of the mutual k-nearest-neighbor graph in clustering and outlier detection , 1997 .

[2]  Yoav Goldberg,et al.  Word Sense Induction with Neural biLM and Symmetric Patterns , 2018, EMNLP.

[3]  David Bamman,et al.  Distributed Representations of Geographically Situated Language , 2014, ACL.

[4]  Lysandre Debut,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[5]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[6]  Rada Mihalcea,et al.  TextRank: Bringing Order into Text , 2004, EMNLP.

[7]  Munmun De Choudhury,et al.  #Anorexia, #anarexia, #anarexyia: Characterizing online community practices with orthographic variation , 2017, 2017 IEEE International Conference on Big Data (Big Data).

[8]  Shen Li,et al.  Diachronic Sense Modeling with Deep Contextualized Word Embeddings: An Ecological View , 2019, ACL.

[9]  Marcin Lewandowski,et al.  Sociolects and Registers – a Contrastive Analysis of Two Kinds of Linguistic Variation , 2010 .

[10]  A. Blank Why do new meanings occur? A cognitive typology of the motivations for lexical semantic change , 1999 .

[11]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[12]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[13]  P. Eckert Three Waves of Variation Study: The Emergence of Meaning in the Study of Sociolinguistic Variation , 2012 .

[14]  Bernd Carsten Stahl,et al.  The Ethical Challenges of Publishing Twitter Data for Research Dissemination , 2017, WebSci.

[15]  Martin Wattenberg,et al.  Visualizing and Measuring the Geometry of BERT , 2019, NeurIPS.

[16]  John C. Paolillo,et al.  Gender and genre variation in weblogs , 2006 .

[17]  T. Postmes,et al.  The Formation of Group Norms in Computer-Mediated Communication , 2000 .

[18]  Andrew W. Moore,et al.  X-means: Extending K-means with Efficient Estimation of the Number of Clusters , 2000, ICML.

[19]  Ellen M. Voorhees,et al.  The seventh text REtrieval conference (TREC-7) , 1999 .

[20]  David Jurgens,et al.  SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses , 2013, SemEval@NAACL-HLT.

[21]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[22]  Gilad Ravid,et al.  Information overload and the message dynamics of online interaction spaces: a theoretical model and empirical exploration , 2004, IEEE Engineering Management Review.

[23]  Ulrike von Luxburg,et al.  A tutorial on spectral clustering , 2007, Stat. Comput..

[24]  Yoav Goldberg,et al.  Towards better substitution-based word sense induction , 2019, ArXiv.

[25]  R. Dodsworth,et al.  Language Variation and Social Networks , 2020 .

[26]  C. Rosé,et al.  Language use as a reflection of socialization in online communities , 2011 .

[27]  Edith Cohen,et al.  Computing classic closeness centrality, at scale , 2014, COSN '14.

[28]  E. Wenger Communities of Practice and Social Learning Systems , 2000 .

[29]  Jon M. Kleinberg,et al.  The Status Gradient of Trends in Social Media , 2016, ICWSM.

[30]  Kira Hall,et al.  Identity and interaction: a sociocultural linguistic approach , 2005, Discourse Studies.

[31]  J. Milroy,et al.  Social network and social class: Toward an integrated sociolinguistic model , 1992, Language in Society.

[32]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[33]  Yang Xu,et al.  Slang Detection and Identification , 2019, CoNLL.

[34]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[35]  J. Holmes,et al.  The Community of Practice: Theories and methodologies in language and gender research , 1999, Language in Society.

[36]  Timothy Baldwin,et al.  unimelb: Topic Modelling-based Word Sense Induction , 2013, SemEval@NAACL-HLT.

[37]  Brian Mac Namee,et al.  Diverging Divergences: Examining Variants of Jensen Shannon Divergence for Corpus Comparison Tasks , 2020, LREC.

[38]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[39]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[40]  Jure Leskovec,et al.  Community Identity and User Engagement in a Multi-Community Landscape , 2017, ICWSM.

[41]  Jure Leskovec,et al.  Loyalty in Online Communities , 2017, ICWSM.

[42]  Joshua A. Fishman,et al.  The sociology of language: An interdisciplinary social science approach to language in society , 1975 .

[43]  Pei-Luen Patrick Rau,et al.  Understanding lurkers in online communities: A literature review , 2014, Comput. Hum. Behav..

[44]  Danqi Chen,et al.  of the Association for Computational Linguistics: , 2001 .

[45]  Jitendra Malik,et al.  Normalized cuts and image segmentation , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[46]  Mario Giulianelli,et al.  Analysing Lexical Semantic Change with Contextualised Word Representations , 2020, ACL.

[47]  Gerlof Bouma,et al.  Normalized (pointwise) mutual information in collocation extraction , 2009 .

[48]  Jeremy Blackburn,et al.  The Pushshift Reddit Dataset , 2020, ICWSM.

[49]  S. P. Lloyd,et al.  Least squares quantization in PCM , 1982, IEEE Trans. Inf. Theory.

[50]  Katie Shilton,et al.  Beyond the Belmont Principles: Ethical Challenges, Practices, and Beliefs in the Online Data Research Community , 2016, CSCW.

[51]  Enis Sert,et al.  AI-KU: Using Substitute Vectors and Co-Occurrence Modeling For Word Sense Induction and Disambiguation , 2013, SemEval@NAACL-HLT.

[52]  Raquel Fernández,et al.  Semantic Variation in Online Communities of Practice , 2018, IWCS.

[53]  Hwee Tou Ng,et al.  Improved Word Sense Disambiguation Using Pre-Trained Contextualized Word Representations , 2019, EMNLP.

[54]  Yi Yang,et al.  Overcoming Language Variation in Sentiment Analysis with Social Attention , 2015, TACL.

[55]  Doug Downey,et al.  Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks , 2020, ACL.

[56]  Lesley Milroy,et al.  Language and social networks , 1980 .

[57]  Jitendra Malik,et al.  Normalized Cuts and Image Segmentation , 2000, IEEE Trans. Pattern Anal. Mach. Intell..

[58]  Pushpak Bhattacharyya,et al.  SlangNet: A WordNet like resource for English Slang , 2016, LREC.

[59]  Daniel Gildea,et al.  Sense Embedding Learning for Word Sense Induction , 2016, *SEM@ACL.

[60]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[61]  Gregor Wiedemann,et al.  Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings , 2019, KONVENS.

[62]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[63]  Jacob Eisenstein,et al.  Unsupervised Domain Adaptation of Contextualized Embeddings for Sequence Labeling , 2019, EMNLP.

[64]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[65]  Christopher M. Danforth,et al.  Divergent discourse between protests and counter-protests: #BlackLivesMatter and #AllLivesMatter , 2016, PloS one.

[66]  Suresh Manandhar,et al.  SemEval-2010 Task 14: Evaluation Setting for Word Sense Induction & Disambiguation Systems , 2009, SEW@NAACL-HLT.

[67]  W. Labov The linguistic consequences of being a lame , 1973, Language in Society.

[68]  P. Eckert,et al.  Think Practically and Look Locally: Language and Gender as Community-Based Practice , 1992 .

[69]  Adilson E. Motter,et al.  Niche as a Determinant of Word Fate in Online Groups , 2010, PloS one.

[70]  Silvia Bernardini,et al.  The WaCky wide web: a collection of very large linguistically processed web-crawled corpora , 2009, Lang. Resour. Evaluation.

[71]  Robert Tibshirani,et al.  Estimating the number of clusters in a data set via the gap statistic , 2000 .

[72]  Thomas Luckmann,et al.  The sociology of language , 1975 .

[73]  Jacob Eisenstein,et al.  Making “fetch” happen: The influence of social and linguistic context on nonstandard word growth and decline , 2018, EMNLP.

[74]  Ellen M. Voorhees,et al.  The TREC-8 Question Answering Track Report , 1999, TREC.

[75]  Seung-won Hwang,et al.  AutoSense Model for Word Sense Induction , 2019, AAAI.

[76]  Alex Leavitt,et al.  "This is a Throwaway Account": Temporary Technical Identities and Perceptions of Anonymity in a Massive Online Community , 2015, CSCW.

[77]  Sergei Vassilvitskii,et al.  k-means++: the advantages of careful seeding , 2007, SODA '07.

[78]  Baobao Chang,et al.  Inducing Word Sense with Automatically Learned Hidden Concepts , 2014, COLING.

[79]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[80]  Raquel Fernández,et al.  The Road to Success: Assessing the Fate of Linguistic Innovations in Online Communities , 2018, COLING.