Visualizing Regional Language Variation Across Europe on Twitter

Geotagged Twitter data allows us to investigate correlations of geographic language variation, both at an interlingual and intralingual level. Based on data-driven studies of such relationships, this paper investigates regional variation of language usage on Twitter across Europe and compares it to traditional research of regional variation. This paper presents a novel method to process large amounts of data and to capture gradual differences in language variation. Visualizing the results by deterministically translating linguistic features into color hues presents a novel view of language variation across Europe, as it is reflected on Twitter. The technique is easy to apply to large amounts of data and provides a fast visual reference that can serve as input for further qualitative studies. The general applicability is demonstrated on a number of studies both across and within national languages. This paper also discusses the unique challenges of large-scale analysis and visualization, and the complementary nature of traditional qualitative and data-driven quantitative methods, and argues for their possible synthesis.

[1]  Dirk Speelman,et al.  A statistical method for the identification and aggregation of regional linguistic variation , 2011, Language Variation and Change.

[2]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[3]  Manfred Görlach Werner Besch/Ulrich Knoop/Wolfgang Putschke/Herbert Ernst Wiegand, eds., Dialektologie. Ein Handbuch zur deutschen und allgemeinen Dialektforschung , 1985 .

[4]  Hans Goebl,et al.  Dialektometrie: Prinzipien und Methoden des Einsatzes der Numerischen Taxonomie im Bereich der Dialektgeographie , 1984 .

[5]  Leif D. Nelson,et al.  False-Positive Psychology , 2011, Psychological science.

[6]  Jacob Eisenstein Systematic patterning in phonologically‐motivated orthographic variation , 2015 .

[7]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[8]  David Bamman,et al.  Distributed Representations of Geographically Situated Language , 2014, ACL.

[9]  James H. Martin,et al.  Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd Edition , 2000, Prentice Hall series in artificial intelligence.

[10]  Peter Auer,et al.  Language and space : an international handbook of linguistic variation , 2009 .

[11]  Diansheng Guo,et al.  Understanding U.S. regional linguistic variation with Twitter data analysis , 2016, Comput. Environ. Urban Syst..

[12]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[13]  Dirk Hovy,et al.  Lörres, Möppes, and the Swiss. (Re)Discovering regional patterns in anonymous social media data , 2019, Journal of Linguistic Geography.

[14]  R. Shackleton,et al.  English-American Speech Relationships , 2005 .

[15]  Jenny Cheshire Syntactic variation and beyond: Gender and social class variation in the use of discourse-new markers1 , 2005 .

[16]  Russell S. Kirby,et al.  The Atlas of North American English: Phonetics, Phonology and Sound Change. A Multimedia Reference Tool , 2007 .

[17]  David Yarowsky,et al.  Exploring Demographic Language Variations to Improve Multilingual Sentiment Analysis in Social Media , 2013, EMNLP.

[18]  John Nerbonne,et al.  Advances in Dialectometry , 2015 .

[19]  Yoav Goldberg,et al.  A Primer on Neural Network Models for Natural Language Processing , 2015, J. Artif. Intell. Res..

[20]  G. Wenker,et al.  Der Sprachatlas : des deutschen Reichs : Dichtung und Wahrheit , .

[21]  Simon Pröll Detecting structures in linguistic maps - Fuzzy clustering for pattern recognition in geostatistical dialectometry , 2013, Lit. Linguistic Comput..

[22]  R. Baayen,et al.  Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially , 2011, PloS one.

[23]  J. Chambers,et al.  Dialectology: MECHANISMS OF VARIATION , 1998 .

[24]  Derek Ruths,et al.  Gender Inference of Twitter Users in Non-English Contexts , 2013, EMNLP.

[25]  Maria Bonner Alfred Lameli. Strukturen im Sprachraum: Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland Berlin [u.a.] De Gruyter 2013 XII, 355 S. Ill., Kt. , 2014 .

[26]  Alfred Lameli,et al.  Strukturen im Sprachraum : Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland , 2013 .

[27]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[28]  Sara Rosenthal,et al.  Age Prediction in Blogs: A Study of Style, Content, and Online Behavior in Pre- and Post-Social Media Generations , 2011, ACL.

[29]  K. Jaberg,et al.  Der Sprachatlas als Forschungsinstrument : kritische Grundlegung und Einführung in den Sprach- und Sachatlas Italiens und der Südschweiz , 1928 .

[30]  Lameli Alfred,et al.  Digitaler Wenker-Atlas (DiWA). Erste vollständige Ausgabe von Georg Wenkers "Sprachatlas des Deutschen Reichs". 1888-1923 handgezeichnet von Emil Maurmann, Georg Wenker und Ferdinand Wrede , 2001 .

[31]  Wilbert Jan Heeringa Measuring dialect pronunciation differences using Levenshtein distance , 2004 .

[32]  Stefan Rabanus,et al.  Creating digital editions of historical maps , 2010 .

[33]  Max Leopold Wagner,et al.  Sprach- und Sachatlas Italiens und der Südschweiz , 1930 .

[34]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[35]  Dirk Hovy,et al.  Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting , 2018, EMNLP.

[36]  Jules Gilliéron,et al.  Atlas linguistique de la France , 1902 .

[37]  Stefan Rabanus Language Mapping Worldwide: Methods and Traditions , 2019, Handbook of the Changing World Language Map.

[38]  Svitlana Volkova,et al.  Inferring Latent User Properties from Texts Published in Social Media , 2015, AAAI.

[39]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[40]  H. Orton,et al.  Survey of English dialects , 1962 .

[41]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[42]  Dirk Hovy,et al.  Multitask Learning for Mental Health Conditions with Limited Social Media Data , 2017, EACL.

[43]  Dirk Hovy,et al.  Cross-lingual syntactic variation over age and gender , 2015, CoNLL.

[44]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[45]  Ewan Klein,et al.  Natural Language Processing with Python , 2009 .

[46]  R. Kitchin,et al.  Big Data, new epistemologies and paradigm shifts , 2014, Big Data Soc..

[47]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[48]  Paul A. Longley,et al.  The Geotemporal Demographics of Twitter Usage , 2015 .

[49]  D. Ruths,et al.  What's in a Name? Using First Names as Features for Gender Inference in Twitter , 2013, AAAI Spring Symposium: Analyzing Microtext.

[50]  David Yarowsky,et al.  Broadly Improving User Classification via Communication-Based Name and Location Clustering on Twitter , 2013, NAACL.

[51]  Dirk Hovy,et al.  User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[52]  Dirk Hovy,et al.  Demographic Factors Improve Classification Performance , 2015, ACL.

[53]  Herbert Ernst Wiegand,et al.  47. Die Einteilung der deutschen Dialekte , 1983 .

[54]  Eric P. Xing,et al.  Diffusion of Lexical Change in Social Media , 2012, PloS one.

[55]  Philip S. Yu,et al.  Empirical Evaluation of Profile Characteristics for Gender Classification on Twitter , 2013, 2013 12th International Conference on Machine Learning and Applications.