Challenges of studying and processing dialects in social media

Dialect features typically do not make it into formal writing, but flourish in social media. This enables largescale variational studies. We focus on three phonological features of African American Vernacular English and their manifestation as spelling variations on Twitter. We discuss to what extent our data can be used to falsify eight sociolinguistic hypotheses. To go beyond the spelling level, we require automatic analysis such as POS tagging, but social media language still challenges language technologies. We show how both newswire- and Twitter-adapted stateof-the-art POS taggers perform significantly worse on AAVE tweets, suggesting that large-scale dialect studies of language variation beyond the surface level are not feasible with out-ofthe-box NLP tools.

[1]  Dirk Hovy,et al.  User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[2]  David Bamman,et al.  Gender identity and lexical variation in social media , 2012, 1210.4567.

[3]  John R. Rickford,et al.  GEOGRAPHICAL DIVERSITY, RESIDENTIAL SEGREGATION, AND THE VITALITY OF AFRICAN AMERICAN VERNACULAR ENGLISH AND ITS SPEAKERS , 2010 .

[4]  J. Rickford,et al.  African American Vernacular English: Features, Evolution, Educational Implications , 1999 .

[5]  Raoul Naroll,et al.  Two Solutions to Galton's Problem , 1961, Philosophy of Science.

[6]  Slav Petrov,et al.  A Universal Part-of-Speech Tagset , 2011, LREC.

[7]  Ribhi Hazin,et al.  The Female Brain , 2007 .

[8]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[9]  Jacob Eisenstein,et al.  Phonological Factors in Social Media Writing , 2013 .

[10]  Robert N. St. Clair,et al.  Language in the Inner City: Studies in the Black English Vernacular. William Labov. Philadelphia: University of Pennsylvania Press, 1972. 412 p. + xxiv. $6.95 paper , 1974 .

[11]  P. Carter Shared spaces, shared structures: Latino social formation and African American English in the U.S. south , 2013 .

[12]  W. Labov The intersection of sex and social class in the course of linguistic change , 1990, Language Variation and Change.

[13]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[14]  Russell S. Kirby,et al.  The Atlas of North American English: Phonetics, Phonology and Sound Change. A Multimedia Reference Tool , 2007 .

[15]  Marco Baroni,et al.  Stereotypical gender actions can be extracted from web text , 2011, J. Assoc. Inf. Sci. Technol..

[16]  John Myhill,et al.  Linguistic correlates of inter-ethnic contact , 1986 .

[17]  P. Smit,et al.  The black population , 1976 .

[18]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[19]  Elizabeth M. Hoeffel,et al.  The Black Population: 2010 , 2012 .

[20]  Erik R. Thomas,et al.  Phonological and Phonetic Characteristics of African American Vernacular English , 2007, Lang. Linguistics Compass.

[21]  David Yarowsky,et al.  Classifying latent user attributes in twitter , 2010, SMUC '10.

[22]  Scott A. Hale,et al.  Where in the World Are You? Geolocation and Language Identification in Twitter* , 2013, ArXiv.

[23]  Dirk Hovy,et al.  Tagging Performance Correlates with Author Age , 2015, ACL.

[24]  Dirk Hovy,et al.  Cross-lingual syntactic variation over age and gender , 2015, CoNLL.

[25]  Walt Wolfram,et al.  The grammar of urban African American Vernacular English , 2004 .

[26]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[27]  James Winters,et al.  Linguistic Diversity and Traffic Accidents: Lessons from Statistical Studies of Cultural Traits , 2013, PloS one.

[28]  Jacob Eisenstein Systematic patterning in phonologically‐motivated orthographic variation , 2015 .