A multi-terabyte relational database for geo-tagged social network data

Despite their relatively low sampling factor, the freely available, randomly sampled status streams of Twitter are very useful sources of geographically embedded social network data. To statistically analyze the information Twitter provides via these streams, we have collected a year's worth of data and built a multi-terabyte relational database from it. The database is designed for fast data loading and to support a wide range of studies focusing on the statistics and geographic features of social networks, as well as on the linguistic analysis of tweets. In this paper we present the method of data collection, the database design, the data loading procedure and special treatment of geo-tagged and multi-lingual data. We also provide some SQL recipes for computing network statistics.

[1]  J. Huchra,et al.  Groups of galaxies. I. Nearby groups , 1982 .

[2]  Aniruddha R. Thakar,et al.  The Hierarchical Triangular Mesh , 2001 .

[3]  Nolan Li,et al.  Batch is back: CasJobs, serving multi-TB data on the Web , 2005, IEEE International Conference on Web Services (ICWS'05).

[4]  Alexander S. Szalay,et al.  Spatial Indexing of Large Multidimensional Databases , 2012, CIDR.

[5]  Péter Mátray,et al.  Building a prototype for network measurement virtual observatory , 2007, MineNet '07.

[6]  Peter Z. Kunszt,et al.  Indexing the Sphere with the Hierarchical Triangular Mesh , 2007, ArXiv.

[7]  Marc A. Smith,et al.  Social SQL: Tools for Exploring Social Databases , 2008, IEEE Data Eng. Bull..

[8]  A. Szalay,et al.  Searchable Sky Coverage of Astronomical Observations: Footprints and Exposures , 2010, 1005.2606.

[9]  Lars Backstrom,et al.  Find me if you can: improving geographical prediction with social and spatial proximity , 2010, WWW '10.

[10]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[11]  Jure Leskovec,et al.  Friendship and mobility: user movement in location-based social networks , 2011, KDD.

[12]  Alexander S. Szalay,et al.  Array requirements for scientific applications and an implementation for microsoft SQL server , 2011, AD '11.

[13]  Kwan-Liu Ma,et al.  Breaking news on twitter , 2012, CHI.

[14]  Axel Bruns,et al.  Tools and methods for capturing Twitter data during natural disasters , 2012, First Monday.

[15]  Lukasz Warchal,et al.  Using Oracle 11.2g Database Server in Social Network Analysis Based on Recursive SQL , 2012, CN.

[16]  Alexander S. Szalay,et al.  SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases , 2012, SSDBM.

[17]  Mourad Oussalah,et al.  A software architecture for Twitter collection, search and geolocation services , 2013, Knowl. Based Syst..

[18]  Sara Cohen,et al.  A Social Network Database that Learns How to Answer Queries , 2013, CIDR.

[19]  Alessandro Vespignani,et al.  The Twitter of Babel: Mapping World Languages through Microblogging Platforms , 2012, PloS one.

[20]  Jugal K. Kalita,et al.  Streaming trend detection in Twitter , 2013, Int. J. Web Based Communities.