On the endogenesis of Twitter's Spritzer and Gardenhose sample streams

Many recent publications deal with trend analysis, event detection or opinion mining on social media data. Twitter, as the most important microblogging service, is often in the focus of these works, as it offers free access to big volumes of data. The free access, on that many publications rely, is composed of a random subset of the complete public status stream. Publications rely particularly on the uniform distribution of tweets in this sample stream, and therefore, till today, one has to trust in the statement of Twitter that the sample data is indeed uniformly distributed1. In our research on the technical properties of Twitter's streaming data, we found evidence for discovering the method used by Twitter to decide which tweets will show up in the random sample streams. A deeper insight into this process leads to the possible reasons of why Twitter chose the presented sampling method. For this purpose we provide an overview of how Twitter's unique tweet IDs are generated and explain the regularities of each part of a tweet ID. This results also in some information about the tweet ID generating infrastructure of Twitter and what kind of knowledge can possibly be derived from small features like the tweet ID.

[1]  Alexei Pozdnoukhov,et al.  Best Paper Award , 2011 .

[2]  Benyuan Liu,et al.  Predicting Flu Trends using Twitter data , 2011, 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS).

[3]  Mizuki Morita,et al.  Twitter Catches The Flu: Detecting Influenza Epidemics using Twitter , 2011, EMNLP.

[4]  Xiao Zhang,et al.  SensePlace2: GeoTwitter analytics support for situational awareness , 2011, 2011 IEEE Conference on Visual Analytics Science and Technology (VAST).

[5]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[6]  José Alberto Hernández,et al.  On the tweet arrival process at Twitter: analysis and applications , 2014, Trans. Emerg. Telecommun. Technol..

[7]  MatsuoYutaka,et al.  Tweet Analysis for Real-Time Event Detection and Earthquake Reporting System Development , 2013 .

[8]  Balachander Krishnamurthy,et al.  A few chirps about twitter , 2008, WOSN '08.

[9]  Kalina Bontcheva,et al.  Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data , 2013, RANLP.

[10]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[11]  Hosung Park,et al.  What is Twitter, a social network or a news media? , 2010, WWW '10.

[12]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[13]  Gregory D. Saxton,et al.  Engaging Stakeholders Through Twitter: How Nonprofit Organizations Are Getting More Out of 140 Characters or Less , 2010, ArXiv.

[14]  Trent Seltzer,et al.  Dialogic communication in 140 characters or less: How Fortune 500 companies engage stakeholders usin , 2010 .

[15]  Wendy Liu,et al.  Homophily and Latent Attribute Inference: Inferring Latent Attributes of Twitter Users from Neighbors , 2012, ICWSM.

[16]  Jugal K. Kalita,et al.  Streaming trend detection in Twitter , 2013, Int. J. Web Based Communities.

[17]  Guandong Xu 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2014, Beijing, China, August 17-20, 2014 , 2014 .

[18]  Matteo Magnani,et al.  Conversation Practices and Network Structure in Twitter , 2012, ICWSM.

[19]  Huan Liu,et al.  Twitter Data Analytics , 2013, SpringerBriefs in Computer Science.

[20]  Max Mühlhäuser,et al.  A Multi-Indicator Approach for Geolocalization of Tweets , 2013, ICWSM.

[21]  Hans Jochen Scholl,et al.  #Sandy Tweets: Citizens' Co-Production of Time-Critical Information during an Unfolding Catastrophe , 2014, 2014 47th Hawaii International Conference on System Sciences.

[22]  Kalina Bontcheva,et al.  Microblog-genre noise and impact on semantic annotation accuracy , 2013, HT.

[23]  Kalina Bontcheva,et al.  TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text , 2013, RANLP.

[24]  Nishanth Sastry,et al.  Proceedings of the 7th International Conference on Weblogs and Social Media, ICWSM 2013 , 2013, ICWSM 2013.

[25]  Yutaka Matsuo,et al.  Tweet Analysis for Real-Time Event Detection and Earthquake Reporting System Development , 2013, IEEE Transactions on Knowledge and Data Engineering.