Where in the World Are You? Geolocation and Language Identification in Twitter*

The movements of ideas and content between locations and languages are unquestionably crucial concerns to researchers of the information age, and Twitter has emerged as a central, global platform on which hundreds of millions of people share knowledge and information. A variety of research has attempted to harvest locational and linguistic metadata from tweets to understand important questions related to the 300 million tweets that flow through the platform each day. Much of this work is carried out with only limited understandings of how best to work with the spatial and linguistic contexts in which the information was produced, however. Furthermore, standard, well-accepted practices have yet to emerge. As such, this article studies the reliability of key methods used to determine language and location of content in Twitter. It compares three automated language identification packages to Twitter's user interface language setting and to a human coding of languages to identify common sources of disagreement. The article also demonstrates that in many cases user-entered profile locations differ from the physical locations from which users are actually tweeting. As such, these open-ended, user-generated profile locations cannot be used as useful proxies for the physical locations from which information is published to Twitter.

[1]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[2]  Devin Gaffney #iranElection: quantifying online activism , 2010 .

[3]  Jeffrey Nichols,et al.  Where Is This Tweet From? Inferring Home Locations of Twitter Users , 2012, ICWSM.

[4]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[5]  Ed H. Chi,et al.  Language Matters In Twitter: A Large Scale Study , 2011, ICWSM.

[6]  Jon Kleinberg,et al.  Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter , 2011, WWW.

[7]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[8]  D. Boyd,et al.  The Arab Spring| The Revolutions Were Tweeted: Information Flows during the 2011 Tunisian and Egyptian Revolutions , 2011 .

[9]  A. Bruns,et al.  #Ausvotes: How twitter covered the 2010 Australian federal election , 2011 .

[10]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[11]  Jennifer Golbeck,et al.  Bridging languages in social networks: How multilingual users of Twitter connect language communities? , 2012, ASIST.

[12]  Wouter Weerkamp,et al.  Semi-Supervised Priors for Microblog Language Identification , 2011 .

[13]  Mark Warschauer,et al.  Language Choice Online: Globalization and Identity in Egypt , 2006, J. Comput. Mediat. Commun..

[14]  Leysia Palen,et al.  Supporting “Everyday Analysts” in Safety- and Time-Critical Situations , 2011, Inf. Soc..

[15]  S. Gorman,et al.  Volunteered Geographic Information and Crowdsourcing Disaster Relief: A Case Study of the Haitian Earthquake , 2010 .

[16]  Leysia Palen,et al.  Microblogging during two natural hazards events: what twitter may contribute to situational awareness , 2010, CHI.

[17]  Susan C. Herring,et al.  Beyond Microblogging: Conversation and Collaboration via Twitter , 2009, 2009 42nd Hawaii International Conference on System Sciences.

[18]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[19]  Barry Wellman,et al.  Geography of Twitter networks , 2012, Soc. Networks.

[20]  Scott A. Hale Net Increase? Cross-Lingual Linking in the Blogosphere , 2019, J. Comput. Mediat. Commun..

[21]  Matthew Zook,et al.  Please Scroll down for Article Journal of Urban Technology Visualizing Global Cyberscapes: Mapping User-generated Placemarks Visualizing Global Cyberscapes: Mapping User-generated Placemarks , 2022 .

[22]  Thomas Gottron,et al.  A Comparison of Language Identification Approaches on Short, Query-Style Texts , 2010, ECIR.

[23]  Jason Baldridge,et al.  Simple supervised document geolocation with geodesic grids , 2011, ACL.

[24]  Ed H. Chi,et al.  Tweets from Justin Bieber's heart: the dynamics of the location field in user profiles , 2011, CHI.

[25]  B. Wellman,et al.  Imagining Twitter as an Imagined Community , 2011 .

[26]  Scott A. Hale Impact of platform design on cross-language information exchange , 2012, CHI EA '12.

[27]  Matthew Zook,et al.  The Technology of Religion: Mapping Religious Cyberscapes , 2012 .

[28]  Henry A. Kautz,et al.  Finding your friends and following them to where you are , 2012, WSDM '12.