TwitterEcho: a distributed focused crawler to support open research with twitter data

Modern social network analysis relies on vast quantities of data to infer new knowledge about human relations and communication. In this paper we describe TwitterEcho, an open source Twitter crawler for supporting this kind of research, which is characterized by a modular distributed architecture. Our crawler enables researchers to continuously collect data from particular user communities, while respecting Twitter's imposed limits. We present the core modules of the crawling server, some of which were specifically designed to focus the crawl on the Portuguese Twittosphere. Additional modules can be easily implemented, thus changing the focus to a different community. Our evaluation of the system shows high crawling performance and coverage.

[1]  Timothy W. Finin,et al.  Why we twitter: understanding microblogging usage and communities , 2007, WebKDD/SNA-KDD '07.

[2]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[3]  Sotiris Ioannidis,et al.  we.b: the web of short urls , 2011, WWW.

[4]  W. B. Cavnar,et al.  N-gram-based text categorization , 1994 .

[5]  Krishna P. Gummadi,et al.  Measuring User Influence in Twitter: The Million Follower Fallacy , 2010, ICWSM.

[6]  Alex Hai Wang,et al.  Don't follow me: Spam detection in Twitter , 2010, 2010 International Conference on Security and Cryptography (SECRYPT).

[7]  Wolfgang Kellerer,et al.  Outtweeting the Twitterers - Predicting Information Cascades in Microblogs , 2010, WOSN.

[8]  Luís Sarmento,et al.  Liars and Saviors in a Sentiment Annotated Corpus of Comments to Political Debates , 2011, ACL.

[9]  Rizal Setya Perdana What is Twitter , 2013 .

[10]  Patrick Paroubek,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2010, LREC.

[11]  Miles Efron,et al.  Hashtag retrieval in a microblogging environment , 2010, SIGIR.

[12]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[13]  David A. Shamma,et al.  Characterizing debate performance via aggregated twitter sentiment , 2010, CHI.

[14]  Mário J. Silva,et al.  Automated Social Network Epidemic Data Collector , 2009 .

[15]  Luís Sarmento,et al.  Characterization of the twitter @replies network: are user ties social or topical? , 2010, SMUC '10.