Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose

Twitter is a social media giant famous for the exchange of short, 140-character messages called "tweets". In the scientific community, the microblogging site is known for openness in sharing its data. It provides a glance into its millions of users and billions of tweets through a "Streaming API" which provides a sample of all tweets matching some parameters preset by the API user. The API service has been used by many researchers, companies, and governmental institutions that want to extract knowledge in accordance with a diverse array of questions pertaining to social media. The essential drawback of the Twitter API is the lack of documentation concerning what and how much data users get. This leads researchers to question whether the sampled data is a valid representation of the overall activity on Twitter. In this work we embark on answering this question by comparing data collected using Twitter's sampled API service with data collected using the full, albeit costly, Firehose stream that includes every single published tweet. We compare both datasets using common statistical metrics as well as metrics that allow us to compare topics, networks, and locations of tweets. The results of our work will help researchers and practitioners understand the implications of using the Streaming API.

[1]  John B. Shoven,et al.  I , Edinburgh Medical and Surgical Journal.

[2]  G. G. Stokes "J." , 1890, The New Yale Book of Quotations.

[3]  Gert Sabidussi,et al.  The centrality index of a graph , 1966 .

[4]  Miss A.O. Penney (b) , 1974, The New Yale Book of Quotations.

[5]  Mark S. Granovetter Network Sampling: Some First Steps , 1976, American Journal of Sociology.

[6]  L. Freeman Centrality in social networks conceptual clarification , 1978 .

[7]  ZVI GALIL,et al.  Efficient algorithms for finding maximum matching in graphs , 1986, CSUR.

[8]  S. T. Dumais,et al.  Using latent semantic analysis to improve access to textual information , 1988, CHI '88.

[9]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[10]  Jianhua Lin,et al.  Divergence measures based on the Shannon entropy , 1991, IEEE Trans. Inf. Theory.

[11]  Duncan J. Watts,et al.  Collective dynamics of ‘small-world’ networks , 1998, Nature.

[12]  Thomas W. Valente,et al.  The stability of centrality measures when networks are sampled , 2003, Soc. Networks.

[13]  Michael I. Jordan,et al.  Latent Dirichlet Allocation , 2001, J. Mach. Learn. Res..

[14]  Kathleen M. Carley,et al.  On the robustness of centrality measures under conditions of imperfect data , 2006, Soc. Networks.

[15]  Christos Faloutsos,et al.  Sampling from large graphs , 2006, KDD '06.

[16]  Gueorgi Kossinets Effects of missing data in social networks , 2006, Soc. Networks.

[17]  Stanley Wasserman,et al.  Social Network Analysis: Methods and Applications , 1994, Structural analysis in the social sciences.

[18]  K. Fernow New York , 1896, American Potato Journal.

[19]  Bertrand De Longueville,et al.  "OMG, from here, I can see the flames!": a use case of mining location based social networks to acquire spatio-temporal data on forest fires , 2009, LBSN '09.

[20]  Kirill Kireyev Applications of Topics Models to Analysis of Disaster-Related Twitter Data , 2009 .

[21]  Nick Koudas,et al.  TwitterMonitor: trend detection over the twitter stream , 2010, SIGMOD Conference.

[22]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[23]  Miles Efron,et al.  Hashtag retrieval in a microblogging environment , 2010, SIGIR.

[24]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[25]  K. Selçuk Candan,et al.  How Does the Data Sampling Strategy Impact the Discovery of Information Diffusion in Social Media? , 2010, ICWSM.

[26]  Brian D. Davison,et al.  Empirical study of topic modeling in Twitter , 2010, SOMA '10.

[27]  Kyumin Lee,et al.  You are where you tweet: a content-based approach to geo-locating twitter users , 2010, CIKM.

[28]  Jiawei Han,et al.  Geographical topic discovery and comparison , 2011, WWW.

[29]  Kazutoshi Sumiya,et al.  Crowd-based urban characterization: extracting crowd behavioral patterns in urban areas from Twitter , 2011, LBSN '11.

[30]  Kathleen M. Carley,et al.  ORA User's Guide 2011 , 2011 .

[31]  Mohammad Ali Abbasi,et al.  TweetTracker: An Analysis Tool for Humanitarian and Disaster Relief , 2011, ICWSM.

[32]  Panagiotis Takis Metaxas,et al.  Limits of Electoral Predictions Using Twitter , 2011, ICWSM.

[33]  Alexei Pozdnoukhov,et al.  Space-time dynamics of topics in streaming text , 2011, LBSN '11.

[34]  Chun How Tan,et al.  Beyond "local", "categories" and "friends": clustering foursquare users with latent "topics" , 2012, UbiComp.

[35]  Lei Yang,et al.  We know what @you #tag: does the dual role affect hashtag adoption? , 2012, WWW.

[36]  David S. Ebert,et al.  Spatiotemporal social media analytics for abnormal event detection and examination using seasonal-trend decomposition , 2012, 2012 IEEE Conference on Visual Analytics Science and Technology (VAST).

[37]  Alexander J. Smola,et al.  Discovering geographical topics in the twitter stream , 2012, WWW.

[38]  Raquel Recuero,et al.  On the rise of artificial trending topics in twitter , 2012, HT '12.

[39]  Matthew Smith,et al.  A real-time architecture for detection of diseases using social networks: design, implementation and evaluation , 2012, HT '12.

[40]  Ari Rappoport,et al.  What's in a hashtag?: content based prediction of the spread of ideas in microblogging communities , 2012, WSDM '12.

[41]  Robert L. Wolpert,et al.  Statistical Inference , 2019, Encyclopedia of Social Network Analysis and Mining.