Crowd crawling: towards collaborative data collection for large-scale online social networks

The emerging research for online social networks (OSNs) requires a huge amount of data. However, OSN sites typically enforce restrictions for data crawling, such as request rate limiting on a per-IP basis. It becomes challenging for an individual research group to collect sufficient data by using its own network resources. In this paper, we introduce and motivate crowd crawling, which allows multiple research groups to efficiently crawl data in a collaborative way. Crowd crawling is carefully designed by addressing several practical challenges including resource diversity of different partners, strict request rate limiting from OSN providers, and data fidelity. We implemented and deployed a crowd crawling prototype on PlanetLab, and demonstrated its performance through evaluations. We have made the datasets crawled in our evaluation publicly available.

[1]  Jon Crowcroft,et al.  The case for crowd computing , 2010, MobiHeld '10.

[2]  Stephen William Edge,et al.  An adaptive timeout algorithm for retransmission across a packet switching network , 1984, Comput. Commun. Rev..

[3]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[4]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[5]  Donald F. Towsley,et al.  Estimating and sampling graphs with multidimensional random walks , 2010, IMC '10.

[6]  Lars Backstrom,et al.  The Anatomy of the Facebook Social Graph , 2011, ArXiv.

[7]  Jian Huang,et al.  Unveiling the Patterns of Video Tweeting: A Sina Weibo-Based Measurement Study , 2013, PAM.

[8]  Long Jin,et al.  Understanding Graph Sampling Algorithms for Social Network Analysis , 2011, 2011 31st International Conference on Distributed Computing Systems Workshops.

[9]  Hector Garcia-Molina,et al.  Parallel crawlers , 2002, WWW.

[10]  Hans-Werner Braun,et al.  Internet Flow Characterization: Adaptive Timeout Strategy and Statistical Modeling , 2001 .

[11]  Athanasios V. Vasilakos,et al.  Understanding user behavior in online social networks: a survey , 2013, IEEE Communications Magazine.

[12]  K. Fu,et al.  Reality Check for the Chinese Microblog Space: A Random Sampling Approach , 2013, PloS one.

[13]  Xue Liu,et al.  Location Cheating: A Security Challenge to Location-Based Social Network Services , 2011, 2011 31st International Conference on Distributed Computing Systems.

[14]  Pablo Rodriguez,et al.  The little engine(s) that could: scaling online social networks , 2012, TNET.

[15]  Christian Huitema,et al.  STUN - Simple Traversal of User Datagram Protocol (UDP) Through Network Address Translators (NATs) , 2003, RFC.

[16]  Karen Rose,et al.  What is Twitter , 2009 .

[17]  Ben Y. Zhao,et al.  Understanding latent interactions in online social networks , 2010, IMC '10.

[18]  Minas Gjoka,et al.  Practical Recommendations on Crawling Online Social Networks , 2011, IEEE Journal on Selected Areas in Communications.

[19]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[20]  Rizal Setya Perdana What is Twitter , 2013 .

[21]  Krishna P. Gummadi,et al.  Defending against large-scale crawls in online social networks , 2012, CoNEXT '12.

[22]  Balachander Krishnamurthy,et al.  Dasu: Pushing Experiments to the Internet's Edge , 2013, NSDI.