Dude, srsly?: The Surprisingly Formal Nature of Twitter's Language

Twitter has become the de facto information sharing and communication platform. Given the factors that influence language on Twitter ‐ size limitation as well as communication and content-sharing mechanisms ‐ there is a continuing debate about the position of Twitter’s language in the spectrum of language on various established mediums. These include SMS and chat on the one hand (size limitations) and email (communication), blogs and newspapers (content sharing) on the other. To provide a way of determining this, we propose a computational framework that offers insights into the linguistic style of all these mediums. Our framework consists of two parts. The first part builds upon a set of linguistic features to quantify the language of a given medium. The second part introduces a flexible factorization framework, SOCLIN, which conducts a psycholinguistic analysis of a given medium with the help of an external cognitive and affective knowledge base. Applying this analytical framework to various corpora from several major mediums, we gather statistics in order to compare the linguistics of Twitter with these other mediums via a quantitative comparative study. We present several key insights: (1) Twitter’s language is surprisingly more conservative, and less informal than SMS and online chat; (2) Twitter users appear to be developing linguistically unique styles; (3) Twitter’s usage of temporal references is similar to SMS and chat; and (4) Twitter has less variation of affect than other more formal mediums. The language of Twitter can thus be seen as a projection of a more formal register into a size-restricted space.

[1]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[2]  Mor Naaman,et al.  Is it really about me?: message content in social awareness streams , 2010, CSCW '10.

[3]  Simeon J. Yates Oral and written linguistic aspects of computer conferencing , 1996 .

[4]  Mark Davies,et al.  The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010, Lit. Linguistic Comput..

[5]  Huan Liu,et al.  Unsupervised sentiment analysis with emotional signals , 2013, WWW.

[6]  Susan T. Dumais,et al.  Mark my words!: linguistic style accommodation in social media , 2011, WWW.

[7]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[8]  Ronald Wardhaugh An introduction to sociolinguistics , 1988 .

[9]  S. Tagliamonte,et al.  LINGUISTIC RUIN? LOL! INSTANT MESSAGING AND TEEN LANGUAGE , 2008 .

[10]  Tao Li,et al.  A Non-negative Matrix Tri-factorization Approach to Sentiment Classification with Lexical Prior Knowledge , 2009, ACL.

[11]  J. Hayes,et al.  A Cognitive Process Theory of Writing , 1981, College Composition & Communication.

[12]  Lois Ann Scheidt,et al.  Bridging the gap: a genre analysis of Weblogs , 2004, 37th Annual Hawaii International Conference on System Sciences, 2004. Proceedings of the.

[13]  A. Brenner Twitter Use 2012 , 2012 .

[14]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[15]  R. Lakoff,et al.  STYLISTIC STRATEGIES WITHIN A GRAMMAR OF STYLE , 1979 .

[16]  Ed H. Chi,et al.  Language Matters In Twitter: A Large Scale Study , 2011, ICWSM.

[17]  Thomas Hofmann,et al.  Probabilistic latent semantic indexing , 1999, SIGIR '99.

[18]  David W. Carroll,et al.  Psychology of Language , 1993 .

[19]  A. D. Shveĭt︠s︡er,et al.  Introduction to sociolinguistics , 1986 .

[20]  Fei Wang,et al.  What Were the Tweets About? Topical Associations between Public Events and Twitter Feeds , 2012, ICWSM.

[21]  Lee Rainie A Biography of the Pew Research Center’s Internet & American Life Project , 2012 .

[22]  Scott A. Golder,et al.  Diurnal and Seasonal Mood Vary with Work, Sleep, and Daylength Across Diverse Cultures , 2011 .

[23]  Oren Etzioni,et al.  Named Entity Recognition in Tweets: An Experimental Study , 2011, EMNLP.

[24]  Jan Svartvik,et al.  A __ comprehensive grammar of the English language , 1988 .

[25]  M. Halliday,et al.  AN INTRODUCTION TO FUNCTIONAL GRAMMAR (Third Edition) , 2022 .

[26]  Brigham Young The Corpus of Contemporary American English as the first reliable monitor corpus of English , 2010 .

[27]  A. Magnifico Writing for Whom? Cognition, Motivation, and a Writer's Audience , 2010 .

[28]  Sali A. Tagliamonte,et al.  Well weird, right dodgy, very strange, really cool: Layering and recycling in English intensifiers , 2003, Language in Society.

[29]  David Crystal,et al.  Language and the Internet , 2001 .

[30]  Ulf Bäcklund,et al.  The collocation of adverbs of degree in English , 1973 .

[31]  Douglas Biber,et al.  Variation across speech and writing: Methodology , 1988 .

[32]  Naomi S. Baron Letters by Phone or Speech by Other Means: The Linguistics of Email. , 1998 .

[33]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[34]  Huan Liu,et al.  mTrust: discerning multi-faceted trust in a connected world , 2012, WSDM '12.

[35]  Chris H. Q. Ding,et al.  Orthogonal nonnegative matrix t-factorizations for clustering , 2006, KDD '06.

[36]  H. Breland Word Frequency and Word Difficulty: A Comparison of Counts in Four Corpora , 1996 .

[37]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[38]  Brendan T. O'Connor,et al.  Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters , 2013, NAACL.