Capturing Regional Variation with Distributed Place Representations and Geographic Retrofitting

Dialects are one of the main drivers of language variation, a major challenge for natural language processing tools. In most languages, dialects exist along a continuum, and are commonly discretized by combining the extent of several preselected linguistic variables. However, the selection of these variables is theory-driven and itself insensitive to change. We use Doc2Vec on a corpus of 16.8M anonymous online posts in the German-speaking area to learn continuous document representations of cities. These representations capture continuous regional linguistic distinctions, and can serve as input to downstream NLP tasks sensitive to regional variation. By incorporating geographic information via retrofitting and agglomerative clustering with structure, we recover dialect areas at various levels of granularity. Evaluating these clusters against an existing dialect map, we achieve a match of up to 0.77 V-score (harmonic mean of cluster completeness and homogeneity). Our results show that representation learning with retrofitting offers a robust general method to automatically expose dialectal differences and regional variation at a finer granularity than was previously possible.

[1]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[2]  John Nerbonne,et al.  Measuring Dialect Distance Phonetically , 1997, SIGMORPHON@EACL.

[3]  Yves Peirsman,et al.  The automatic identification of lexical variation between language varieties , 2010, Natural Language Engineering.

[4]  J. H. Ward Hierarchical Grouping to Optimize an Objective Function , 1963 .

[5]  Carolyn Penstein Rosé,et al.  Computational Sociolinguistics: A Survey , 2016, Computational Linguistics.

[6]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[7]  Benedikt Szmrecsanyi,et al.  A statistical method for the identification and aggregation of regional linguistic variation , 2011 .

[8]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[9]  Dirk Hovy,et al.  Cross-lingual syntactic variation over age and gender , 2015, CoNLL.

[10]  Timothy Baldwin,et al.  Continuous Representation of Location for Geolocation and Lexical Dialectology using Mixture Density Networks , 2017, EMNLP.

[11]  Sharon Goldwater,et al.  Topic and audience effects on distinctively Scottish vocabulary usage in Twitter data , 2017 .

[12]  Jacob Eisenstein Systematic patterning in phonologically‐motivated orthographic variation , 2015 .

[13]  Dirk Hovy,et al.  Exploring Language Variation Across Europe - A Web-based Tool for Computational Sociolinguistics , 2016, LREC.

[14]  Dirk Hovy,et al.  Huntsville, hospitals, and hockey teams: Names can reveal your location , 2017, NUT@EMNLP.

[15]  Raoul Naroll,et al.  Two Solutions to Galton's Problem , 1961, Philosophy of Science.

[16]  Wulf Oesterreicher,et al.  Sprache der Nähe — Sprache der Distanz. Mündlichkeit und Schriftlichkeit im Spannungsfeld von Sprachtheorie und Sprachgeschichte , 1985, Romanistisches Jahrbuch.

[17]  Timothy Baldwin,et al.  Visualizing Regional Language Variation Across Europe on Twitter , 2019, Handbook of the Changing World Language Map.

[18]  Taylor Jones Toward a Description of African American Vernacular English Dialect Regions Using “Black Twitter” , 2015 .

[19]  Quoc V. Le,et al.  Distributed Representations of Sentences and Documents , 2014, ICML.

[20]  Timothy Baldwin,et al.  Text-Based Twitter User Geolocation Prediction , 2014, J. Artif. Intell. Res..

[21]  Timothy Baldwin,et al.  A Neural Model for User Geolocation and Lexical Dialectology , 2017, ACL.

[22]  Sharon Goldwater,et al.  Aye or naw, whit dae ye hink? Scottish independence and linguistic identity on social media , 2017, EACL.

[23]  Christoph Purschke,et al.  Language regard and cultural practice – Variation, evaluation, and change in the German regional languages , 2018 .

[24]  Shane S. Sturrock,et al.  Time Warps, String Edits, and Macromolecules – The Theory and Practice of Sequence Comparison . David Sankoff and Joseph Kruskal. ISBN 1-57586-217-4. Price £13.95 (US$22·95). , 2000 .

[25]  R. Baayen,et al.  Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially , 2011, PloS one.

[26]  John Nerbonne,et al.  Recognising Groups among Dialects , 2008, Int. J. Humanit. Arts Comput..

[27]  David Sanchez,et al.  Dialectometric analysis of language variation in Twitter , 2017, VarDial.

[28]  Jörg Tiedemann,et al.  Continuous multilinguality with language vectors , 2016, EACL.

[29]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[30]  Timothy Baldwin,et al.  An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation , 2016, Rep4NLP@ACL.

[31]  Christopher D. Manning,et al.  Better Word Representations with Recursive Neural Networks for Morphology , 2013, CoNLL.

[32]  Shachar Mirkin,et al.  Personalized Machine Translation: Predicting Translational Preferences , 2015, EMNLP.

[33]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[34]  Dirk Hovy,et al.  Increasing In-Class Similarity by Retrofitting Embeddings with Demographic Information , 2018, EMNLP.

[35]  Steven Skiena,et al.  Statistically Significant Detection of Linguistic Change , 2014, WWW.

[36]  Dirk Hovy,et al.  User Review Sites as a Resource for Large-Scale Sociolinguistic Studies , 2015, WWW.

[37]  P. Koch,et al.  Sprache der Nähe — Sprache der Distanz. Mündlichkeit und Schriftlichkeit im Spannungsfeld von Sprachtheorie und Sprachgeschichte , 1985, Romanistisches Jahrbuch.

[38]  R. Shackleton,et al.  English-American Speech Relationships , 2005 .

[39]  Jack Grieve,et al.  Regional Variation in Written American English , 2016 .

[40]  Steven Skiena,et al.  Freshman or Fresher? Quantifying the Geographic Variation of Language in Online Social Media , 2016, ICWSM.

[41]  P. Trudgill Sociolinguistics: An Introduction to Language and Society , 1975 .

[42]  Benedikt Szmrecsanyi,et al.  Corpus-based Dialectometry: Aggregate Morphosyntactic Variability in British English Dialects , 2008, Int. J. Humanit. Arts Comput..

[43]  David Bamman,et al.  Distributed Representations of Geographically Situated Language , 2014, ACL.

[44]  Diansheng Guo,et al.  Understanding U.S. regional linguistic variation with Twitter data analysis , 2016, Comput. Environ. Urban Syst..

[45]  Jure Leskovec,et al.  Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change , 2016, ACL.

[46]  Alfred Lameli,et al.  Strukturen im Sprachraum : Analysen zur arealtypologischen Komplexität der Dialekte in Deutschland , 2013 .

[47]  Jacob Eisenstein,et al.  What to do about bad language on the internet , 2013, NAACL.

[48]  Albrecht Plewnia,et al.  Sprache – Einstellungen – Regionalität , 2012 .

[49]  Stefan Rabanus,et al.  Creating digital editions of historical maps , 2010 .