TweetGeo - A Tool for Collecting, Processing and Analysing Geo-encoded Linguistic Data

In this paper we present a newly developed tool that enables researchers interested in spatial variation of language to define a geographic perimeter of interest, collect data from the Twitter streaming API published in that perimeter, filter the obtained data by language and country, define and extract variables of interest and analyse the extracted variables by one spatial statistic and two spatial visualisations. We showcase the tool on the area and a selection of languages spoken in former Yugoslavia. By defining the perimeter, languages and a series of linguistic variables of interest we demonstrate the data collection, processing and analysis capabilities of the tool.

[1]  Tomaz Erjavec,et al.  TweetCaT: a tool for building Twitter corpora of smaller languages , 2014, LREC.

[2]  John Nerbonne,et al.  Data-driven Dialectology , 2008 .

[3]  John Nerbonne,et al.  Advances in Dialectometry , 2015 .

[4]  Peter J. Diggle,et al.  Statistical Analysis of Spatial and Spatio-Temporal Point Patterns , 2013 .

[5]  Oliver Falck,et al.  Why are Educated and Risk-Loving Persons More Mobile Across Regions? , 2012, SSRN Electronic Journal.

[6]  Nikola Ljubesic,et al.  New Inflectional Lexicons and Training Corpora for Improved Morphosyntactic Annotation of Croatian and Serbian , 2016, LREC.

[7]  Timothy Baldwin,et al.  langid.py: An Off-the-shelf Language Identification Tool , 2012, ACL.

[8]  Simon J. Greenhill,et al.  Evolved structure of language shows lineage-specific trends in word-order universals , 2011, Nature.

[9]  P. Moran Notes on continuous stochastic phenomena. , 1950, Biometrika.

[10]  R. Baayen,et al.  Quantitative Social Dialectology: Explaining Linguistic Variation Geographically and Socially , 2011, PloS one.

[11]  Brent J. Hecht,et al.  A Tale of Cities: Urban Biases in Volunteered Geographic Information , 2014, ICWSM.

[12]  Dirk Hovy,et al.  Exploring Language Variation Across Europe - A Web-based Tool for Computational Sociolinguistics , 2016, LREC.

[13]  Trisalyn A. Nelson,et al.  Statistical analysis of spatial and spatio-temporal point patterns, Third Edition, by Peter J. Diggle, Boca Raton, FL, CRC Press, 2013, 263 pp., $49.99, $79.95 EUR 62, 38 (hardback), ISBN 13:978-1-4665-6023-9 , 2015, Int. J. Geogr. Inf. Sci..

[14]  Eric P. Xing,et al.  Discovering Sociolinguistic Associations with Structured Sparsity , 2011, ACL.

[15]  Heidrun Schumann,et al.  Space, time and visual analytics , 2010, Int. J. Geogr. Inf. Sci..

[16]  Benedikt Szmrecsanyi,et al.  Geography is overrated , 2012 .

[17]  Shaowen Wang,et al.  Mapping the global Twitter heartbeat: The geography of Twitter , 2013, First Monday.

[18]  Carolyn Penstein Rosé,et al.  Author Age Prediction from Text using Linear Regression , 2011, LaTeCH@ACL.

[19]  Jure Leskovec,et al.  No country for old members: user lifecycle and linguistic change in online communities , 2013, WWW.

[20]  Gabriel Doyle,et al.  Mapping Dialectal Variation by Querying Social Media , 2014, EACL.

[21]  Ross Purves,et al.  Twitter location (sometimes) matters: Exploring the relationship between georeferenced tweet content and nearby feature classes , 2014, J. Spatial Inf. Sci..