Computational Sociolinguistics: A Survey

Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of “computational sociolinguistics” that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction, and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions used in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.

[1]  M. Williams,et al.  Who Tweets? Deriving the Demographic Characteristics of Age, Occupation and Social Class from Twitter User Meta-Data , 2015, PloS one.

[2]  Jacob Eisenstein,et al.  Confounds and Consequences in Geotagged Twitter Data , 2015, EMNLP.

[3]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[4]  Yiming Yang,et al.  Flexible latent variable models for multi-task learning , 2008, Machine Learning.

[5]  Shlomo Argamon,et al.  Automatically profiling the author of an anonymous text , 2009, CACM.

[6]  Justine Cassell,et al.  The Language of Online Intercultural Community Formation , 2006, J. Comput. Mediat. Commun..

[7]  Susan C. Herring,et al.  The Multilingual Internet: Language, Culture, and Communication Online , 2007 .

[8]  E. Hovy,et al.  Contextual Bearing on Linguistic Variation in Social Media , 2011 .

[9]  Sudeshna Sarkar,et al.  Stylometric Analysis of Bloggers' Age and Gender , 2009, ICWSM.

[10]  Carolyn Penstein Rosé,et al.  Modeling of Stylistic Variation in Social Media with Stretchy Patterns , 2011 .

[11]  Dong Nguyen,et al.  Predicting Code-switching in Multilingual Communication for Immigrant Communities , 2014, CodeSwitch@EMNLP.

[12]  Franciska de Jong,et al.  Croatian Memories : speech, meaning and emotions in a collection of interviews on experiences of war and trauma. , 2014, LREC 2014.

[13]  David Yarowsky,et al.  Modeling Latent Biographic Attributes in Conversational Genres , 2009, ACL.

[14]  Jean-Marc Dewaele,et al.  Variation in the Contextuality of Language: An Empirical Measure , 2002 .

[15]  Krishna P. Gummadi,et al.  Predicting emerging social conventions in online social networks , 2012, CIKM.

[16]  Timothy Baldwin,et al.  Automatic Detection and Language Identification of Multilingual Documents , 2014, TACL.

[17]  Shlomo Argamon,et al.  Mining the Blogosphere: Age, gender and the varieties of self-expression , 2007, First Monday.

[18]  Daniele Quercia,et al.  In the Mood for Being Influential on Twitter , 2011, 2011 IEEE Third Int'l Conference on Privacy, Security, Risk and Trust and 2011 IEEE Third Int'l Conference on Social Computing.

[19]  Jennifer Golbeck,et al.  Multilingual use of Twitter: Social networks at the language frontier , 2014, Comput. Hum. Behav..

[20]  David Yarowsky,et al.  Stylometric Analysis of Scientific Articles , 2012, NAACL.

[21]  Yang Liu,et al.  Learning to Predict Code-Switching Points , 2008, EMNLP.

[22]  H. Giles,et al.  Relational and Identity Processes in Communication: A Contextual and Meta-Analytical Review of Communication Accommodation Theory , 2014 .

[23]  Brendan T. O'Connor,et al.  Cheap and Fast – But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks , 2008, EMNLP.

[24]  Tara S. Behrend,et al.  The viability of crowdsourcing for survey research , 2011, Behavior research methods.

[25]  Nir Friedman,et al.  Probabilistic Graphical Models - Principles and Techniques , 2009 .

[26]  T. Postmes,et al.  The Formation of Group Norms in Computer-Mediated Communication , 2000 .

[27]  Owen Rambow,et al.  Gender and Power: How Gender and Gender Environment Affect Manifestations of Power , 2014, EMNLP.

[28]  Carolyn Penstein Rosé,et al.  Learning Analytics in the Learning Sciences , 2018 .

[29]  Gert Smolka,et al.  A Complete and Recursive Feature Theory , 1994, ACL.

[30]  Jack Grieve,et al.  Quantitative Authorship Attribution: An Evaluation of Techniques , 2007, Lit. Linguistic Comput..

[31]  John D. Burger,et al.  Discriminating Gender on Twitter , 2011, EMNLP.

[32]  Benjamin Van Durme Streaming Analysis of Discourse Participants , 2012, EMNLP-CoNLL.

[33]  I. Fischer You Just Don T Understand Women And Men In Conversation , 2016 .

[34]  Bethan Benwell,et al.  Discourse and Identity , 2006 .

[35]  Charles Darwin,et al.  Experiments , 1800, The Medical and physical journal.

[36]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[37]  木村 和夫 Pragmatics , 1997, Language Teaching.

[38]  Robert N. Stern,et al.  The External Control of Organizations: A Resource Dependence Perspective. , 1979 .

[39]  Xiang Yan,et al.  Gender Classification of Weblog Authors , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[40]  Efstathios Stamatatos,et al.  A survey of modern authorship attribution methods , 2009, J. Assoc. Inf. Sci. Technol..

[41]  Uriel Weinreich,et al.  Languages in Contact: French, German and Romansh in twentieth-century Switzerland , 2011 .

[42]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[43]  Keith Richards,et al.  Language and Professional Identity , 2006 .

[44]  Francisco Yus,et al.  Discourse and Identity , 2001 .

[45]  Carolyn Penstein Rosé,et al.  Linguistic Reflections of Student Engagement in Massive Open Online Courses , 2014, ICWSM.

[46]  Tony Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery , 2009 .

[47]  PeirsmanYves,et al.  The automatic identification of lexical variation between language varieties , 2010 .

[48]  Linda M. Collins,et al.  Latent class and latent transition analysis , 2009 .

[49]  K. Krippendorff Krippendorff, Klaus, Content Analysis: An Introduction to its Methodology . Beverly Hills, CA: Sage, 1980. , 1980 .

[50]  John D. Burger,et al.  An Exploration of Observable Features Related to Blogger Age , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[51]  Carolyn Penstein Rosé,et al.  Multi-Domain Learning: When Do Domains Matter? , 2012, EMNLP-CoNLL.

[52]  Lise Getoor,et al.  Relationship Identification for Social Network Discovery , 2007, AAAI.

[53]  Çağrı Çöltekin,et al.  Detecting Shibboleths , 2012, EACL 2012.

[54]  Maarten Sap,et al.  Developing Age and Gender Predictive Lexica over Social Media , 2014, EMNLP.

[55]  Adam N. Joinson,et al.  Finding Zelig in Text: A Measure for Normalising Linguistic Accommodation , 2014, COLING.

[56]  Walter Daelemans,et al.  Explanation in Computational Stylometry , 2013, CICLing.

[57]  Daniel Jurafsky,et al.  Extracting Social Meaning: Identifying Interactional Style in Spoken Conversation , 2009, NAACL.

[58]  Yiming Yang,et al.  The Enron Corpus: A New Dataset for Email Classi(cid:12)cation Research , 2004 .

[59]  Susan T. Dumais,et al.  Mark my words!: linguistic style accommodation in social media , 2011, WWW.

[60]  Suzanne Evans Wagner,et al.  Age Grading in Sociolinguistic Theory , 2012, Lang. Linguistics Compass.

[61]  W. Labov The social stratification of English in New York City , 1969 .

[62]  A. Seza Doğruöz,et al.  Innovative constructions in Dutch Turkish: An assessment of ongoing contact-induced change* , 2009, Bilingualism: Language and Cognition.

[63]  Lisa Lim,et al.  Languages in Contact , 2015 .

[64]  K. Hyland,et al.  Disciplinary Discourses, Michigan Classics Ed.: Social Interactions in Academic Writing , 2004 .

[65]  P. Trudgill The Social Differentiation of English in Norwich , 1974 .

[66]  Sophia Rabe-Hesketh,et al.  Multilevel and Longitudinal Modeling Using Stata, Second Edition , 2008 .

[67]  V. Hinnenkamp,et al.  Deutsch, Doyc or Doitsch? Chatters as Languagers – The Case of a German–Turkish Chat Room , 2008 .

[68]  Bernard C. K. Choi,et al.  Multidisciplinarity, interdisciplinarity and transdisciplinarity in health research, services, education and policy: 1. Definitions, objectives, and evidence of effectiveness. , 2006, Clinical and investigative medicine. Medecine clinique et experimentale.

[69]  Michael Piotrowski,et al.  Natural Language Processing for Historical Texts , 2012, Synthesis Lectures on Human Language Technologies.

[70]  Hiroshi Yamaguchi,et al.  Text Segmentation by Language Using Minimum Description Length , 2012, ACL.

[71]  Donald M. Taylor,et al.  Towards a theory of interpersonal accommodation through language: some Canadian data , 1973, Language in Society.

[72]  Margaret L. Kern,et al.  Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach , 2013, PloS one.

[73]  Kathleen M. Carley,et al.  Exploration of communication networks from the Enron email corpus , 2005 .

[74]  Subbarao Kambhampati,et al.  Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language , 2013, ICWSM.

[75]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[76]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[77]  P. Nelde Languages in contact , 1990 .

[78]  George M. Mohay,et al.  Gender-preferential text mining of e-mail discourse , 2002, 18th Annual Computer Security Applications Conference, 2002. Proceedings..

[79]  Björn W. Schuller,et al.  The INTERSPEECH 2010 paralinguistic challenge , 2010, INTERSPEECH.

[80]  Arjun Mukherjee,et al.  Improving Gender Classification of Blog Authors , 2010, EMNLP.

[81]  Elijah Mayfield,et al.  Automating annotation of information-giving for analysis of clinical conversation. , 2014, Journal of the American Medical Informatics Association : JAMIA.

[82]  Penelope Brown,et al.  Politeness: Some Universals in Language Usage , 1989 .

[83]  Yang Liu,et al.  Part-of-Speech Tagging for English-Spanish Code-Switched Text , 2008, EMNLP.

[84]  R. Ordelman,et al.  Improved cyberbullying detection using gender information , 2012 .

[85]  Timothy Baldwin,et al.  Geolocation Prediction in Social Media Data by Finding Location Indicative Words , 2012, COLING.

[86]  W. Labov,et al.  Empirical foundations for a theory of language change , 2014 .

[87]  Ngoc Thang Vu,et al.  Combination of Recurrent Neural Networks and Factored Language Models for Code-Switching Language Modeling , 2013, ACL.

[88]  Stephanie T. Lanza,et al.  Latent Class and Latent Transition Analysis: With Applications in the Social, Behavioral, and Health Sciences , 2009 .

[89]  Penelope Gardner-Chloros,et al.  Assumptions Behind Grammatical Approaches To Code-Switching: When The Blueprint Is A Red Herring , 2004 .

[90]  Dragomir R. Radev,et al.  Detecting Subgroups in Online Discussions by Modeling Positive and Negative Relations among Participants , 2012, EMNLP.

[91]  Scott A. Hale Global connectivity and multilinguals in the Twitter network , 2014, CHI.

[92]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[93]  David Rose,et al.  Working with Discourse: Meaning Beyond the Clause , 2003 .

[94]  Carolyn Penstein Rosé,et al.  Weakly Supervised Role Identification in Teamwork Interactions , 2015, ACL.

[95]  H. Giles,et al.  Accommodation theory: Communication, context, and consequence. , 1991 .

[96]  M. Wood Language: Contexts and Consequences. , 1993 .

[97]  Preslav Nakov,et al.  Overview of the DSL Shared Task 2015 , 2015 .

[98]  Michael Wayne Simmonds,et al.  The Norfolk Dialect , 2015 .

[99]  Sheila A. Brennan,et al.  Oral History in the Digital Age , 2010 .

[100]  Federica Barbieri Patterns of age-based linguistic variation in American English , 2008 .

[101]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[102]  Neal Topp,et al.  Online Data Collection , 2002 .

[103]  Timothy Baldwin,et al.  Language Identification: The Long and the Short of the Matter , 2010, NAACL.

[104]  Elisabeth Stark,et al.  sms4science: An international corpus-based texting project and the specific challenges for multilingual Switzerland , 2011 .

[105]  Ad Backus,et al.  Postverbal elements in immigrant Turkish: Evidence of change? , 2007 .

[106]  John M. Prager,et al.  Linguini: language identification for multilingual documents , 1999, Proceedings of the 32nd Annual Hawaii International Conference on Systems Sciences. 1999. HICSS-32. Abstracts and CD-ROM of Full Papers.

[107]  Marco Lui,et al.  Classifying English Documents by National Dialect , 2013, ALTA.

[108]  Crispin Thurlow,et al.  Digital discourse : language in the new media , 2011 .

[109]  Deborah Tannen,et al.  Framing in Discourse , 1993 .

[110]  S. Herring Computer-Mediated Discourse Analysis : An Approach to Researching Online Behavior , 2004 .

[111]  John Nerbonne,et al.  Dialect areas and dialect continua , 2001, Language Variation and Change.

[112]  Christine Mallinson,et al.  Data collection in sociolinguistics : methods and applications , 2017 .

[113]  Theodoros Tzouramanis,et al.  A robust gender inference model for online social networks and its application to LinkedIn and Twitter , 2014, First Monday.

[114]  Nicholas Diakopoulos,et al.  Cooooooooooooooollllllllllllll!!!!!!!!!!!!!! Using Word Lengthening to Detect Sentiment in Microblogs , 2011, EMNLP.

[115]  John Nerbonne,et al.  Hierarchical Spectral Partitioning of Bipartite Graphs to Cluster Dialects and Identify Distinguishing Features , 2010, TextGraphs@ACL.

[116]  Dirk Hovy,et al.  Challenges of studying and processing dialects in social media , 2015, NUT@IJCNLP.

[117]  Christopher D. Manning Computational Linguistics and Deep Learning , 2015, Computational Linguistics.

[118]  Owen Rambow,et al.  Written Dialog and Social Power: Manifestations of Different Types of Power in Dialog Behavior , 2013, IJCNLP.

[119]  W. Bruce Croft,et al.  Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2013 .

[120]  Kira Hall,et al.  Identity and interaction: a sociocultural linguistic approach , 2005, Discourse Studies.

[121]  Beatrice Alex,et al.  An Unsupervised System for Identifying English Inclusions in German Text , 2005, ACL.

[122]  David Yarowsky,et al.  Hierarchical Bayesian Models for Latent Attribute Detection in Social Media , 2011, ICWSM.

[123]  P. Shukla,et al.  A bilingual parser for Hindi , English and code-switching structures , 2022 .

[124]  Dong Nguyen,et al.  TweetGenie: Development, Evaluation, and Lessons Learned , 2014, COLING.

[125]  Dong Nguyen,et al.  Word Level Language Identification in Online Multilingual Communication , 2013, EMNLP.

[126]  Benno Stein,et al.  Overview of the Author Profiling Task at PAN 2013 , 2013, CLEF.

[127]  P. Eckert Three Waves of Variation Study: The Emergence of Meaning in the Study of Sociolinguistic Variation , 2012 .

[128]  Walter Daelemans,et al.  Predicting age and gender in online social networks , 2011, SMUC '11.

[129]  Chris Callison-Burch,et al.  Arabic Dialect Identification , 2014, CL.

[130]  Danah Boyd,et al.  I tweet honestly, I tweet passionately: Twitter users, context collapse, and the imagined audience , 2011, New Media Soc..

[131]  Jahna Otterbacher,et al.  Learning the lingo?: gender, prestige and linguistic adaptation in review communities , 2012, CSCW '12.

[132]  J. Holmes Women, Men and Politeness , 1995 .

[133]  Jacob Andreas,et al.  Detecting Influencers in Written Online Conversations , 2012 .

[134]  Heyan Huang,et al.  Lifetime Lexical Variation in Social Media , 2014, AAAI.

[135]  Mona T. Diab,et al.  Token Level Identification of Linguistic Code Switching , 2012, COLING.

[136]  Jon Oberlander,et al.  Weblogs, genres and individual differences , 2005 .

[137]  John C. Paolillo,et al.  Gender and genre variation in weblogs , 2006 .

[138]  Mark Steyvers,et al.  Finding scientific topics , 2004, Proceedings of the National Academy of Sciences of the United States of America.

[139]  John Edwards,et al.  Bilingualism , 2004 .

[140]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[141]  Maite Taboada,et al.  Applications of Rhetorical Structure Theory , 2006 .

[142]  Tomek Strzalkowski,et al.  Modeling Leadership and Influence in Multi-party Online Discourse , 2012, COLING.

[143]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[144]  Paul McNamee,et al.  Language identification: a solved problem suitable for undergraduate instruction , 2005 .

[145]  Csr Young,et al.  How to Do Things With Words , 2009 .

[146]  Carol Myers-Scotton,et al.  Contact Linguistics: Bilingual encounters and grammatical outcomes , 2013 .

[147]  F. Heider Attitudes and cognitive organization. , 1946, The Journal of psychology.

[148]  Krishna P. Gummadi,et al.  The Emergence of Conventions in Online Social Networks , 2012, ICWSM.

[149]  Dorien Van De Mieroop,et al.  Language and Professional Identity: Aspects of Collaborative Interaction , 2008 .

[150]  Carolyn Penstein Rosé,et al.  Measuring prevalence of other-oriented transactive contributions using an automated measure of speech style accommodation , 2013, International Journal of Computer-Supported Collaborative Learning.

[151]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[152]  Sali A. Tagliamonte Analysing Sociolinguistic Variation , 2006 .

[153]  D. Holmes The Evolution of Stylometry in Humanities Scholarship , 1998 .

[154]  Preslav Nakov,et al.  Predicting Dialect Variation in Immigrant Contexts Using Light Verb Constructions , 2014, EMNLP.

[155]  E. Mark Gold,et al.  Language Identification in the Limit , 1967, Inf. Control..

[156]  Fei Huang Improved Arabic Dialect Classification with Social Media Data , 2015, EMNLP.

[157]  Jean Carletta,et al.  Assessing Agreement on Classification Tasks: The Kappa Statistic , 1996, CL.

[158]  Benjamin Van Durme,et al.  Using Conceptual Class Attributes to Characterize Social Media Users , 2013, ACL.

[159]  John Nerbonne,et al.  Statistics for Aggregate Variationist Analyses , 2017 .

[160]  Alon Lavie,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL 2012, July 12-14, 2012, Jeju Island, Korea , 2012 .

[161]  Carolyn Penstein Rosé,et al.  An Unsupervised Dynamic Bayesian Network Approach to Measuring Speech Style Accommodation , 2012, EACL.

[162]  E. Schegloff Sequence Organization in Interaction: Contents , 2007 .

[163]  John Nerbonne,et al.  Automatically Extracting Typical Syntactic Differences from Corpora , 2011, Lit. Linguistic Comput..

[164]  Carolyn Penstein Rosé,et al.  What's in a Domain? Multi-Domain Learning for Multi-Attribute Data , 2013, HLT-NAACL.

[165]  Jeannett Martin,et al.  The Language of Evaluation: Appraisal in English , 2005 .

[166]  John Yen,et al.  A Model to Qualify the Linguistic Adaptation Phenomenon in Online Conversation Threads: Analyzing Priming Effect in Online Health Community , 2014, CMCL@ACL.

[167]  Franciska de Jong,et al.  Towards modeling expressed emotions in oral history interviews: Using verbal and nonverbal signals to track personal narratives , 2014, Lit. Linguistic Comput..

[168]  Craig H. Martell,et al.  Age Detection in Chat , 2009, 2009 IEEE International Conference on Semantic Computing.

[169]  Dragomir R. Radev,et al.  Experiments in Sentence Language Identification with Groups of Similar Languages , 2014, VarDial@COLING.

[170]  Barbara Di Eugenio,et al.  Squibs and Discussions: The Kappa Statistic: A Second Look , 2004, CL.

[171]  Yves Peirsman,et al.  The automatic identification of lexical variation between language varieties , 2010, Natural Language Engineering.

[172]  Reid G. Simmons,et al.  Perception of Personality and Naturalness through Dialogues by Native Speakers of American English and Arabic , 2011, SIGDIAL Conference.

[173]  Lars Hinrichs Codeswitching on the Web: English and Jamaican Creole in E-mail Communication (Pragmatics & Beyond, Issn 0922-842x) , 2006 .

[174]  P. Eckert,et al.  Language and Gender: Introduction to the study of language and gender , 2013 .

[175]  Julia Hockenmaier,et al.  Beefmoves: Dissemination, Diversity, and Dynamics of English Borrowings in a German Hip Hop Forum , 2012, ACL.

[176]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[177]  John Nerbonne,et al.  Advances in Dialectometry , 2015 .

[178]  Neil Green Meaning-text theory: Linguistics, lexicography, and implications , 2004, Machine Translation.

[179]  Dong Nguyen,et al.  Audience and the Use of Minority Languages on Twitter , 2015, ICWSM.

[180]  Derek Ruths,et al.  Classifying Political Orientation on Twitter: It's Not Easy! , 2013, ICWSM.

[181]  Francisco Iacobelli,et al.  Computational Measures for Language Similarity Across Time in Online Communities , 2006, HLT-NAACL 2006.

[182]  Mona T. Diab,et al.  Sentence Level Dialect Identification in Arabic , 2013, ACL.

[183]  Jörg Tiedemann,et al.  A Report on the DSL Shared Task 2014 , 2014, VarDial@COLING.

[184]  Dirk Hovy,et al.  Cross-lingual syntactic variation over age and gender , 2015, CoNLL.

[185]  Jeffrey T. Hancock,et al.  Language Style Matching as a Predictor of Social Dynamics in Small Groups , 2010, Commun. Res..

[186]  Susan C. Herring,et al.  A Faceted Classification Scheme for Computer-Mediated Discourse , 2007 .

[187]  Brook Bolander,et al.  Doing sociolinguistic research on computer-mediated data : a review of four methodological issues , 2014 .

[188]  Ana Guinote,et al.  The social psychology of power , 2010 .

[189]  Loizos Michael,et al.  Write Like I Write: Herding in the Language of Online Reviews , 2014, ICWSM.

[190]  Hassan Sajjad,et al.  Verifiably Effective Arabic Dialect Identification , 2014, EMNLP.

[191]  David Bamman,et al.  Contextualized Sarcasm Detection on Twitter , 2015, ICWSM.

[192]  Cristian Danescu-Niculescu-Mizil,et al.  Chameleons in Imagined Conversations: A New Approach to Understanding Coordination of Linguistic Style in Dialogs , 2011, CMCL@ACL.

[193]  John C. Paolillo Language variation on Internet Relay Chat: A social network approach , 2001 .

[194]  J. Butler Gender Trouble: Feminism and the Subversion of Identity , 1990 .

[195]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[196]  Djoerd Hiemstra,et al.  An exploration of language identification techniques for the Dutch folktale database , 2012 .

[197]  Alice H. Oh,et al.  Self-Disclosure and Relationship Strength in Twitter Conversations , 2012, ACL.

[198]  Anat Rachel Shimoni,et al.  Gender, genre, and writing style in formal written texts , 2003 .

[199]  Jon M. Kleinberg,et al.  Echoes of power: language effects and power differences in social interaction , 2011, WWW.

[200]  Li Wang,et al.  How Noisy Social Media Text, How Diffrnt Social Media Sources? , 2013, IJCNLP.

[201]  Derek Ruths,et al.  Twitter Users #CodeSwitch Hashtags! #MoltoImportante #wow , 2014, CodeSwitch@EMNLP.

[202]  Adriana Kovashka,et al.  Authorship Attribution Using Probabilistic Context-Free Grammars , 2010, ACL.

[203]  D. Britain,et al.  Crowdsourcing Language Change with Smartphone Applications , 2016, PloS one.

[204]  Owen Rambow,et al.  Predicting Overt Display of Power in Written Dialogs , 2012, NAACL.

[205]  J. Pennebaker,et al.  Linguistic Style Matching in Social Interaction , 2002 .

[206]  Hilbert J. Kappen,et al.  Approximate Inference and Constrained Optimization , 2002, UAI.

[207]  David Y. W. Lee,et al.  Genres, Registers, Text Types, Domains and Styles: Clarifying the Concepts and Navigating a Path through the BNC Jungle , 2001 .

[208]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[209]  Yejin Choi,et al.  Gender Attribution: Tracing Stylometric Evidence Beyond Topic and Genre , 2011, CoNLL.

[210]  Rakesh Mohan Bhatt,et al.  Code-switching and the optimal grammar of bilingual language use* , 2011, Bilingualism: Language and Cognition.

[211]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[212]  Jure Leskovec,et al.  A computational approach to politeness with application to social factors , 2013, ACL.

[213]  Martijn Wieling,et al.  Measuring foreign accent strength in English : Validating Levenshtein distance as a measure , 2014 .

[214]  Sarah G. Thomason,et al.  Language Contact: An Introduction , 2001 .

[215]  Alexander Yates,et al.  Improving Word Alignment Using Linguistic Code Switching Data , 2014, EACL.

[216]  Carolyn Penstein Rosé,et al.  Modeling the Use of Graffiti Style Features to Signal Social Relations within a Multi-Domain Learning Paradigm , 2014, EACL.

[217]  Gary Geunbae Lee,et al.  Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , 2012, ACL 2012.

[218]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[219]  Michelle C. Braña-Straw Codeswitching on the Web: English and Jamaican Creole in E-mail Communication , 2008 .

[220]  Jacob Eisenstein,et al.  "You're Mr. Lebowski, I'm the Dude": Inducing Address Term Formality in Signed Social Networks , 2015, HLT-NAACL.

[221]  Jacob Eisenstein,et al.  AUDIENCE-MODULATED VARIATION IN ONLINE SOCIAL MEDIA , 2015 .

[222]  James Paul Gee,et al.  话语分析入门 : 理论与方法 = An introduction to discourse analysis : theory and method , 1999 .

[223]  Jure Leskovec,et al.  Signed networks in social media , 2010, CHI.

[224]  C. Rosé,et al.  Language use as a reflection of socialization in online communities , 2011 .

[225]  Dong Nguyen,et al.  Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment , 2014, COLING.

[226]  Gregory R. Guy The cognitive coherence of sociolects: How do speakers handle multiple sociolinguistic variables? , 2013 .

[227]  James Steele,et al.  Meaning-text theory : linguistics, lexicography, and implications , 1990 .

[228]  Dirk Hovy,et al.  User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[229]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[230]  Shlomo Argamon,et al.  Effects of Age and Gender on Blogging , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[231]  N. Fairclough,et al.  Language and Power , 2009 .

[232]  Christa S. C. Asterhan,et al.  Socializing Intelligence Through Academic Talk and Dialogue , 2015 .

[233]  Benedikt Szmrecsanyi,et al.  A statistical method for the identification and aggregation of regional linguistic variation , 2011 .

[234]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[235]  J. Auer,et al.  A conversation analytic approach to code-switching and transfer , 2003 .

[236]  Aron Culotta,et al.  Inferring latent attributes of Twitter users with label regularization , 2015, NAACL.

[237]  Wang Ling,et al.  Microblogs as Parallel Corpora , 2013, ACL.

[238]  Klaus Krippendorff,et al.  Content Analysis: An Introduction to Its Methodology , 1980 .

[239]  Ben King,et al.  Labeling the Languages of Words in Mixed-Language Documents using Weakly Supervised Methods , 2013, NAACL.

[240]  William C. Mann,et al.  Rhetorical Structure Theory: Toward a functional theory of text organization , 1988 .

[241]  Nikolaos Aletras,et al.  An analysis of the user occupational class through Twitter content , 2015, ACL.

[242]  G. Whitney Computer‐mediated communication: Linguistic, social, and cross‐cultural perspectives , 1998 .

[243]  Philip Resnik,et al.  Modeling topic control to detect influence in conversations using nonparametric topic models , 2014, Machine Learning.

[244]  A. Pentland,et al.  Life in the network: The coming age of computational social science: Science , 2009 .

[245]  Vagia Tsiminaki,et al.  Hi YouTube!: personality impressions and verbal content in social video , 2013, ICMI '13.

[246]  John Nerbonne,et al.  Data-driven Dialectology , 2008 .

[247]  Penelope Eckert,et al.  Jocks and Burnouts: Social Categories and Identity in the High School , 1989 .

[248]  Owen Rambow,et al.  The Pursuit of Power and Its Manifestation in Written Dialog , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[249]  Christopher M. Danforth,et al.  Characterizing the Google Books Corpus: Strong Limits to Inferences of Socio-Cultural and Linguistic Evolution , 2015, PloS one.

[250]  Julia Hirschberg,et al.  Entrainment in Speech Preceding Backchannels. , 2011, ACL.

[251]  Yves Scherrer,et al.  Word-Based Dialect Identification with Georeferenced Rules , 2010, EMNLP.

[252]  Nanyun Peng,et al.  Learning Polylingual Topic Models from Code-Switched Social Media Documents , 2014, ACL.

[253]  Bill Noble,et al.  Centre Stage: How Social Network Position Shapes Linguistic Coordination , 2015, CMCL@NAACL-HLT.

[254]  Darren Gergle,et al.  In CMC we trust: the role of similarity , 2009, CHI.

[255]  Hanna Zijlstra,et al.  Validiteit van de Nederlandse versie van de Linguistic Inquiry and Word Count (liwc) , 2005 .

[256]  Kerry Mullan The Guidebook to Sociolinguistics , 2017 .

[257]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[258]  Eric Gilbert,et al.  Predicting tie strength with social media , 2009, CHI.

[259]  Yafei Wang,et al.  Linguistic Adaptation in Conversation Threads: Analyzing Alignment in Online Health Communities , 2014 .

[260]  D. Snow,et al.  Identity Work Among the Homeless: The Verbal Construction and Avowal of Personal Identities , 1987, American Journal of Sociology.

[261]  Rafael Alonso,et al.  Extracting Social Power Relationships from Natural Language , 2011, ACL.

[262]  Andreas Stolcke,et al.  The ICSI Meeting Corpus , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[263]  Kareem Darwish,et al.  Using Twitter to Collect a Multi-Dialectal Corpus of Arabic , 2014, ANLP@EMNLP.

[264]  Fei Xia,et al.  Email Formality in the Workplace: A Case Study on the Enron Corpus , 2011 .

[265]  Clare R. Voss,et al.  Finding Romanized Arabic Dialect in Code-Mixed Tweets , 2014, LREC.

[266]  Aravind K. Joshi,et al.  Processing of Sentences With Intra-Sentential Code-Switching , 1982, COLING.

[267]  Siobhan Chapman Logic and Conversation , 2005 .

[268]  Donna M. Johnson,et al.  Politeness: Some Universals in Language Usage (Studies in Interactional Sociolinguistics 4) , 1988 .

[269]  Ana-Maria Popescu,et al.  A Machine Learning Approach to Twitter User Classification , 2011, ICWSM.

[270]  Fabio Rinaldi,et al.  A robust and hybrid deep-linguistic theory applied to large-scale parsing , 2004, COLING 2004.

[271]  Sophia Rabe-Hesketh,et al.  Multilevel and Longitudinal Modeling Using Stata , 2005 .

[272]  C. Myers-Scotton Social Motivations For Codeswitching: Evidence from Africa , 1994 .

[273]  Carolyn Penstein Rosé,et al.  Analyzing collaborative learning processes automatically: Exploiting the advances of computational linguistics in computer-supported collaborative learning , 2008, Int. J. Comput. Support. Collab. Learn..

[274]  Branca Telles Ribeiro,et al.  Discourse and Identity: Footing, positioning, voice. Are we talking about the same things? , 2006 .

[275]  Katja Filippova,et al.  User Demographics and Language in an Implicit Social Network , 2012, EMNLP.

[276]  Carolyn Penstein Rose,et al.  What Sociolinguistics and Machine Learning Have to Say to Each Other About Interaction Analysis , 2015 .

[277]  Timothy Baldwin,et al.  Reconsidering Language Identification for Written Language Resources , 2006, LREC.

[278]  Jahna Otterbacher,et al.  Inferring gender of movie reviewers: exploiting writing style, content and metadata , 2010, CIKM.

[279]  Eric Gilbert,et al.  Phrases that signal workplace hierarchy , 2012, CSCW.

[280]  A. Koller,et al.  Speech Acts: An Essay in the Philosophy of Language , 1969 .

[281]  Jon Oberlander,et al.  The Identity of Bloggers: Openness and Gender in Personal Weblogs , 2006, AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs.

[282]  Carolyn Penstein Rosé,et al.  Sentiment Analysis in MOOC Discussion Forums: What does it tell us? , 2014, EDM.

[283]  Liza Tsaliki,et al.  Globalisation and hybridity: the construction of Greekness on the Internet , 2003 .

[284]  A. Bell Language style as audience design , 1984, Language in Society.

[285]  Yoram Bachrach,et al.  Studying User Income through Language, Behaviour and Affect in Social Media , 2015, PloS one.

[286]  Antal van den Bosch,et al.  Using idiolects and sociolects to improve word prediction , 2014, EACL.

[287]  Dirk Hovy,et al.  Tagging Performance Correlates with Author Age , 2015, ACL.

[288]  M. Gordon,et al.  Sociolinguistics: Method and Interpretation , 2003 .

[289]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[290]  Sameer Singh,et al.  A Pilot Study on Gender Differences in Conversational Speech on Lexical Richness Measures , 2001, Lit. Linguistic Comput..

[291]  J. Holmes,et al.  The handbook of language and gender , 2003 .

[292]  Richard Scheines,et al.  Discovering Causal Structure: Artificial Intelligence, Philosophy of Science, and Statistical Modeling , 1987 .

[293]  M. González Politeness: some universals in language usage , 1995 .

[294]  D. Sankoff,et al.  The social correlates and linguistic processes of lexical borrowing and assimilation , 1988 .

[295]  Jatin Sharma,et al.  POS Tagging of English-Hindi Code-Mixed Social Media Content , 2014, EMNLP.

[296]  Wei Li,et al.  The "why" and "how" questions in the analysis of conversational code-switching , 1998 .

[297]  David Bamman,et al.  Distributed Representations of Geographically Situated Language , 2014, ACL.

[298]  Shuly Wintner,et al.  Formal Language Theory for Natural Language Processing , 2002, ACL 2002.

[299]  Jure Leskovec,et al.  Exploiting Social Network Structure for Person-to-Person Sentiment Analysis , 2014, TACL.

[300]  Hui Wang,et al.  A Motif Approach for Identifying Pursuits of Power in Social Discourse , 2012, 2012 IEEE Sixth International Conference on Semantic Computing.

[301]  Dong Nguyen,et al.  "How Old Do You Think I Am?" A Study of Language and Age in Twitter , 2013, ICWSM.

[302]  Mark S. Granovetter The Strength of Weak Ties , 1973, American Journal of Sociology.

[303]  Bill Broyles Notes , 1907, The Classical Review.

[304]  Owen Rambow,et al.  Who’s (Really) the Boss? Perception of Situational Power in Written Interactions , 2012, COLING.

[305]  P. Eckert Age as a Sociolinguistic Variable , 2017 .

[306]  Shlomo Argamon,et al.  Automatically Categorizing Written Texts by Author Gender , 2002, Lit. Linguistic Comput..

[307]  Clayton Fink,et al.  Inferring Gender from the Content of Tweets: A Region Specific Example , 2012, ICWSM.

[308]  Matthew Rowe,et al.  Towards Modelling Language Innovation Acceptance in Online Social Networks , 2016, WSDM.

[309]  Dorée D. Seligmann,et al.  Who Had the Upper Hand? Ranking Participants of Interactions Based on Their Relative Power , 2013, IJCNLP.

[310]  W. Labov Principles Of Linguistic Change , 1994 .

[311]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[312]  Blaise Cronin,et al.  Disciplinary Discourses: Social Interactions in Academic Writing , 2002, J. Documentation.

[313]  Gillian Sankoff,et al.  Age: Apparent Time and Real Time , 2006 .

[314]  Alice H. Oh,et al.  Sociolinguistic analysis of Twitter in multilingual societies , 2014, HT.

[315]  Mona T. Diab,et al.  Simplified guidelines for the creation of Large Scale Dialectal Arabic Annotations , 2012, LREC.

[316]  Iris K. Howley,et al.  Linguistic Analysis Methods for Studying Small Groups , 2013 .

[317]  J. Milroy,et al.  Linguistic change, social network and speaker innovation , 1985, Journal of Linguistics.

[318]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[319]  Ursula Faber,et al.  Sequence Organization In Interaction A Primer In Conversation Analysis , 2016 .

[320]  Ying Li,et al.  Code-Switch Language Model with Inversion Constraints for Mixed Language Speech Recognition , 2012, COLING.

[321]  Julia Hirschberg,et al.  Overview for the First Shared Task on Language Identification in Code-Switched Data , 2014, CodeSwitch@EMNLP.

[322]  Mari Ostendorf,et al.  A Quantitative Analysis of Lexical Differences Between Genders in Telephone Conversations , 2005, ACL.

[323]  Owen Rambow,et al.  Staying on Topic: An Indicator of Power in Political Debates , 2014, EMNLP.

[324]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[325]  Benno Stein,et al.  Overview of the 2 nd Author Profiling Task at PAN 2014 , 2014 .

[326]  Anthony J. G. Hey,et al.  The Fourth Paradigm: Data-Intensive Scientific Discovery [Point of View] , 2011 .

[327]  Lars Hinrichs,et al.  Codeswitching on the web , 2006 .

[328]  David A. Huffaker,et al.  Dimensions of leadership and social influence in online communities , 2010 .

[329]  Jonathan T. Morgan,et al.  Annotating Social Acts: Authority Claims and Alignment Moves in Wikipedia Talk Pages , 2011 .

[330]  Oliver Ferschke,et al.  Behind the Article: Recognizing Dialog Acts in Wikipedia Talk Pages , 2012, EACL.

[331]  Jacob Eisenstein,et al.  Phonological Factors in Social Media Writing , 2013 .

[332]  Peter Auer,et al.  Language and space : an international handbook of linguistic variation , 2009 .

[333]  Ryan Cotterell,et al.  An Algerian Arabic-French Code-Switched Corpus , 2014 .

[334]  A. D. Shveĭt︠s︡er,et al.  Introduction to sociolinguistics , 1986 .