Tracking Language Mobility in the Twitter Landscape

The unprecedented data explosion has drastically changed the data science landscape. At the same time, Big Data analytics have reshaped the design and implementation of the applications that analyse the data. In this paper, we explore the use of Big Data tools for extracting value from Twitter data. We acquire a large set of Twitter data (10TB in size) and process it by relying on Spark DataFrame. The purpose of our analytics pipeline is to study the mobility of languages as captured by the Twitter signal. We study the evolution of languages from both a temporal and a spatial perspective, by applying density-based clustering and Self-Organising Maps techniques. The analysis enabled the detection of tourism trends and real-world events, as perceived through the Twitter lens.

[1]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[2]  Barbara Poblete,et al.  Do all birds tweet the same?: characterizing twitter around the world , 2011, CIKM '11.

[3]  James A. Anderson,et al.  Neurocomputing: Foundations of Research , 1988 .

[4]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[5]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[6]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[7]  Leysia Palen,et al.  Microblogging during two natural hazards events: what twitter may contribute to situational awareness , 2010, CHI.

[8]  Krishna P. Gummadi,et al.  Geographic Dissection of the Twitter Network , 2012, ICWSM.

[9]  Nello Cristianini,et al.  Flu Detector - Tracking Epidemics on Twitter , 2010, ECML/PKDD.

[10]  Scott A. Hale,et al.  Where in the World Are You? Geolocation and Language Identification in Twitter* , 2013, ArXiv.

[11]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[12]  Hans-Peter Kriegel,et al.  A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise , 1996, KDD.

[13]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[14]  Teuvo Kohonen,et al.  Self-organized formation of topologically correct feature maps , 2004, Biological Cybernetics.

[15]  Ed H. Chi,et al.  Language Matters In Twitter: A Large Scale Study , 2011, ICWSM.

[16]  Matthew Zook,et al.  Social Media and the City: Rethinking Urban Socio-Spatial Inequality Using User-Generated Geographic Information , 2015 .

[17]  Fabrício Benevenuto,et al.  Reverse engineering socialbot infiltration strategies in Twitter , 2014, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[18]  Jiajun Liu,et al.  Multi-scale population and mobility estimation with geo-tagged Tweets , 2014, 2015 31st IEEE International Conference on Data Engineering Workshops.

[19]  Mark Dredze,et al.  Twitter as a Source of Global Mobility Patterns for Social Good , 2016, ArXiv.