A Faceted Crawler for the Twitter Service

Researchers, nowadays, have at their disposal valuable data from social networking applications, of which Twitter and Facebook are the most prominent examples. To retrieve this content, the Twitter service provides 2 distinct Application Programming Interfaces (APIs): a probe-based and a streaming one, each of which imposes different limitations on the data collection process. In this paper, we present a general architecture to facilitate faceted crawling of the service, which simplifies retrieval. We give implementation details of our system, while providing a simple way to express the crawling process, i.e., the crawl flow. We experimentally evaluate it on a variety of faceted crawls, depicting its efficacy for the online medium.

[1]  George Valkanas,et al.  Location Extraction from Social Networks with Commodity Software and Online Data , 2012, 2012 IEEE 12th International Conference on Data Mining Workshops.

[2]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[3]  Michael K. Bergman White Paper: The Deep Web: Surfacing Hidden Value , 2001 .

[4]  B. Huberman,et al.  The Deep Web : Surfacing Hidden Value , 2000 .

[5]  Hector Garcia-Molina Challenges in Crawling the Web , 2003, BNCOD.

[6]  James Caverlee,et al.  Content-based crowd retrieval on the real-time web , 2012, CIKM.

[7]  Brendan T. O'Connor,et al.  A Latent Variable Model for Geographic Lexical Variation , 2010, EMNLP.

[8]  Bu-Sung Lee,et al.  Event Detection in Twitter , 2011, ICWSM.

[9]  George Valkanas,et al.  How the live web feels about events , 2013, CIKM.

[10]  Ilknur Celik,et al.  Leveraging the Semantics of Tweets for Adaptive Faceted Search on Twitter , 2011, SEMWEB.

[11]  Minas Gjoka,et al.  Walking in Facebook: A Case Study of Unbiased Sampling of OSNs , 2010, 2010 Proceedings IEEE INFOCOM.

[12]  Niloy Ganguly,et al.  Spammers' networks within online social networks: a case-study on Twitter , 2011, WWW.

[13]  Carlos Castillo,et al.  Effective web crawling , 2005, SIGF.

[14]  Walter Willinger,et al.  On unbiased sampling for unstructured peer-to-peer networks , 2009, TNET.

[15]  Nicola Barbieri,et al.  Influence-Based Network-Oblivious Community Detection , 2013, 2013 IEEE 13th International Conference on Data Mining.

[16]  Alexander J. Smola,et al.  Hierarchical geographical modeling of user locations from social media posts , 2013, WWW.

[17]  Marc Najork,et al.  Web Crawling , 2010, Found. Trends Inf. Retr..

[18]  Marc Najork,et al.  Mercator: A scalable, extensible Web crawler , 1999, World Wide Web.

[19]  Sergey Brin,et al.  The Anatomy of a Large-Scale Hypertextual Web Search Engine , 1998, Comput. Networks.

[20]  Hector Garcia-Molina,et al.  Effective page refresh policies for Web crawlers , 2003, TODS.

[21]  Vern Paxson,et al.  @spam: the underground on 140 characters or less , 2010, CCS '10.

[22]  Malcolm P. Atkinson Databases and the Grid: Who Challenges Whom? , 2003, BNCOD.

[23]  Duncan J. Watts,et al.  Everyone's an influencer: quantifying influence on twitter , 2011, WSDM '11.