Distributed Large-Scale Data Collection in Online Social Networks

The popularity and huge amount of information published in Online Social Networks (OSN) established them as one of the main data sources for a variety of research community fields. However, the design of a large-scale dataset collection campaign is a major problem for organizations and researchers who aim in addressing their research questions by analyzing this type of data. OSN platforms provide Application Programming Interfaces (API) to third party developers, which enable them to retrieve and use this data for applications deployment. However, due to OSN imposed limitations, the process of retrieving large scale data with the use of these APIs is challenging and time consuming, resulting in datasets which are either incomplete or outdated. It is relatively impossible for an individual scientist or research group to follow an efficient dataset collection procedure and build a large sample in a short amount of time. In this paper we present a framework for efficient crowd crawling of OSN. Our framework is based on the use of multiple OSN accounts, which are engaged in an efficient distributed collection process able to circumvent the imposed limitations without violating the terms of use. We present an evaluation of the proposed solution and demonstrate its performance in terms of dataset completeness and timeliness, for the case study of Twitter, one of the most popular platforms used in research.

[1]  Daniel Villatoro,et al.  From Tweets to Semantic Trajectories: Mining Anomalous Urban Mobility Patterns , 2013, CitiSens.

[2]  Omer F. Rana,et al.  International Journal of Parallel, Emergent and Distributed Systems Cosmos: towards an Integrated and Scalable Service for Analysing Social Media on Demand Cosmos: towards an Integrated and Scalable Service for Analysing Social Media on Demand , 2022 .

[3]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[4]  Jure Leskovec,et al.  The bursty dynamics of the Twitter information network , 2014, WWW.

[5]  Huan Liu,et al.  Crawling Twitter Data , 2014 .

[6]  Sean P. Goggins,et al.  Twitter zombie: architecture for capturing, socially transforming and analyzing the twittersphere , 2012, GROUP.

[7]  Jeanna Neefe Matthews,et al.  Coalmine: an experience in building a system for social media analytics , 2012, Defense + Commercial Sensing.

[8]  Krishna P. Gummadi,et al.  Measurement and analysis of online social networks , 2007, IMC '07.

[9]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[10]  Xiaoming Fu,et al.  Crowd crawling: towards collaborative data collection for large-scale online social networks , 2013, COSN '13.

[11]  Dieter Pfoser,et al.  Similarity Search on Spatio-Textual Point Sets , 2016, EDBT.

[12]  Kevin Driscoll,et al.  Big Data, Big Questions| Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data , 2014 .

[13]  Guido Wachsmuth,et al.  Facilitating Twitter data analytics: Platform, language and functionality , 2014, 2014 IEEE International Conference on Big Data (Big Data).

[14]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[15]  Minas Gjoka,et al.  Practical Recommendations on Crawling Online Social Networks , 2011, IEEE Journal on Selected Areas in Communications.

[16]  Alan Mislove,et al.  The Tweets They Are a-Changin: Evolution of Twitter Users and Behavior , 2014, ICWSM.

[17]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[18]  Marios D. Dikaiakos,et al.  Identification of key locations based on online social network activity , 2015, 2015 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[19]  D. Ruths,et al.  Social media for large studies of behavior , 2014, Science.

[20]  José Martins,et al.  TwitterEcho: a distributed focused crawler to support open research with twitter data , 2012, WWW.

[21]  Timos K. Sellis,et al.  Twitter analytics: a big data management perspective , 2014, SKDD.

[22]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[23]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[24]  Marios D. Dikaiakos,et al.  Users key locations in online social networks: identification and applications , 2016, Social Network Analysis and Mining.

[25]  Marios D. Dikaiakos,et al.  A distributed middleware infrastructure for personalized services , 2004, Comput. Commun..

[26]  Michael B. Jones,et al.  The OAuth 2.0 Authorization Framework: Bearer Token Usage , 2012, RFC.

[27]  Bernardo A. Huberman,et al.  Predicting the Future with Social Media , 2010, Web Intelligence.