Freshman or Fresher? Quantifying the Geographic Variation of Language in Online Social Media

In this paper we present a new computational technique to detect and analyze statistically significant geographic variation in language. While previous approaches have primarily focused on lexical variation between regions, our method identifies words that demonstrate semantic and syntactic variation as well. We extend recently developed techniques for neural language models to learn word representations which capture differing semantics across geographical regions. In order to quantify this variation and ensure robust detection of true regional differences, we formulate a null model to determine whether observed changes are statistically significant. Our method is the first such approach to explicitly account for random variation due to chance while detecting regional variation in word meaning. To validate our model, we study and analyze two different massive online data sets: millions of tweets from Twitter as well as millions of phrases contained in the Google Book Ngrams. Our analysis reveals interesting facets of language change across countries.

[1]  Natalie Schilling-Estes,et al.  American English: Dialects and Variation , 1998 .

[2]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[3]  Sali A. Tagliamonte Analysing Sociolinguistic Variation , 2006 .

[4]  James Milroy,et al.  Linguistic variation and change : on the historical sociolinguistics of English , 1994 .

[5]  Brendan T. O'Connor,et al.  Discovering Demographic Language Variation , 2010 .

[6]  David Bamman,et al.  Distributed Representations of Geographically Situated Language , 2014, ACL.

[7]  Steven Skiena,et al.  POLYGLOT-NER: Massive Multilingual Named Entity Recognition , 2014, SDM.

[8]  Joan M. Fayer,et al.  Linguistic variation and change , 1992 .

[9]  Graeme Hirst,et al.  Distributional Measures of Semantic Distance: A Survey , 2012, ArXiv.

[10]  Steven Skiena,et al.  The Expressive Power of Word Embeddings , 2013, ArXiv.

[11]  Björn-Olav Dozo,et al.  Quantitative Analysis of Culture Using Millions of Digitized Books , 2010 .

[12]  Jason Weston,et al.  Natural Language Processing (Almost) from Scratch , 2011, J. Mach. Learn. Res..

[13]  eon BottouAT Stochastic Gradient Learning in Neural Networks , 2022 .

[14]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[15]  Igor Brigadir,et al.  Analyzing Discourse Communities with Distributional Semantic Models , 2015, WebSci.

[16]  Steven Skiena,et al.  Freshman or Fresher? Quantifying the Geographic Variation of Internet Language , 2015, ArXiv.

[17]  Marco Baroni,et al.  A distributional similarity approach to the detection of semantic change in the Google Books Ngram corpus. , 2011, GEMS.

[18]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[19]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[20]  Yoshua Bengio,et al.  Neural Probabilistic Language Models , 2006 .

[21]  Slav Petrov,et al.  Syntactic Annotations for the Google Books NGram Corpus , 2012, ACL.

[22]  Charu C. Aggarwal,et al.  Outlier Analysis , 2013, Springer New York.

[23]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[24]  Yoshua Bengio,et al.  Hierarchical Probabilistic Neural Network Language Model , 2005, AISTATS.

[25]  Yoav Goldberg,et al.  A Dataset of Syntactic-Ngrams over Time from a Very Large Corpus of English Books , 2013, *SEMEVAL.

[26]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.

[27]  G. Hommel,et al.  Confidence interval or p-value?: part 4 of a series on evaluation of scientific publications. , 2009, Deutsches Arzteblatt international.

[28]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[29]  Huajun Chen,et al.  The Semantic Web , 2011, Lecture Notes in Computer Science.

[30]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[31]  Steven Skiena,et al.  Inducing Language Networks from Continuous Space Word Representations , 2014, CompleNet.

[32]  Steven Skiena,et al.  Polyglot: Distributed Word Representations for Multilingual NLP , 2013, CoNLL.

[33]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[34]  Slav Petrov,et al.  Temporal Analysis of Language through Neural Language Models , 2014, LTCSS@ACL.

[35]  Geoffrey E. Hinton,et al.  Learning distributed representations of concepts. , 1989 .

[36]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[37]  W. Labov Locating Language in Time and Space , 1980 .

[38]  Geoffrey E. Hinton,et al.  A Scalable Hierarchical Distributed Language Model , 2008, NIPS.

[39]  Bruno Gonçalves,et al.  Crowdsourcing Dialect Characterization through Twitter , 2014, PloS one.

[40]  Martin C. Cooper Measuring the Semantic Distance between Languages from a Statistical Analysis of Bilingual Dictionaries* , 2008, J. Quant. Linguistics.

[41]  M. de Rijke,et al.  Ad Hoc Monitoring of Vocabulary Shifts over Time , 2015, CIKM.

[42]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[43]  Gail M. Sullivan,et al.  Using Effect Size-or Why the P Value Is Not Enough. , 2012, Journal of graduate medical education.