Tampering with Twitter’s Sample API

Social media data is widely analyzed in computational social science. Twitter, one of the largest social media platforms, is used for research, journalism, business, and government to analyze human behavior at scale. Twitter offers data via three different Application Programming Interfaces (APIs). One of which, Twitter’s Sample API, provides a freely available 1% and a costly 10% sample of all Tweets. These data are supposedly random samples of all platform activity. However, we demonstrate that, due to the nature of Twitter’s sampling mechanism, it is possible to deliberately influence these samples, the extent and content of any topic, and consequently to manipulate the analyses of researchers, journalists, as well as market and political analysts trusting these data sources. Our analysis also reveals that technical artifacts can accidentally skew Twitter’s samples. Samples should therefore not be regarded as random. Our findings illustrate the critical limitations and general issues of big data sampling, especially in the context of proprietary data and undisclosed details about data handling.

[1]  G. N. Gilbert Computational Social Science , 2010 .

[2]  Gilad Mishne,et al.  Finding high-quality content in social media , 2008, WSDM '08.

[3]  Brendan T. O'Connor,et al.  From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series , 2010, ICWSM.

[4]  Christo Wilson,et al.  Peeking Beneath the Hood of Uber , 2015, Internet Measurement Conference.

[5]  L. Palen,et al.  Crisis informatics—New data for extraordinary times , 2016, Science.

[6]  Jürgen Pfeffer,et al.  Identifying Platform Effects in Social Media Data , 2016, ICWSM.

[7]  Filippo Menczer,et al.  The rise of social bots , 2014, Commun. ACM.

[8]  N. Newman,et al.  Digital news report 2013 , 2013 .

[9]  Juan Echeverria,et al.  Discovery, Retrieval, and Analysis of the 'Star Wars' Botnet in Twitter , 2017, ASONAM.

[10]  Christos Faloutsos,et al.  zooRank: Ranking Suspicious Entities in Time-Evolving Tensors , 2017, ECML/PKDD.

[11]  Preslav Nakov,et al.  SemEval-2016 Task 4: Sentiment Analysis in Twitter , 2016, *SEMEVAL.

[12]  Huan Liu,et al.  Can One Tamper with the Sample API?: Toward Neutralizing Bias from Spam and Bot Content , 2016, WWW.

[13]  Bernhard Rieder,et al.  UvA-DARE ( Digital Academic Repository ) Mining one percent of Twitter : collections , baselines , sampling , 2013 .

[14]  Jacob Ratkiewicz,et al.  Detecting and Tracking Political Abuse in Social Media , 2011, ICWSM.

[15]  Dominic L. Lasorsa,et al.  NORMALIZING TWITTER , 2012 .

[16]  Kathleen M. Carley,et al.  Two 1%s Don't Make a Whole: Comparing Simultaneous Samples from Twitter's Streaming API , 2014, SBP.

[17]  Huan Liu,et al.  When is it biased?: assessing the representativeness of twitter's streaming API , 2014, WWW.

[18]  Owen Rambow,et al.  Sentiment Analysis of Twitter Data , 2011 .

[19]  Meena Nagarajan,et al.  Proceedings of the Workshop on Languages in Social Media , 2011 .

[20]  Lillian Lee,et al.  Opinion Mining and Sentiment Analysis , 2008, Found. Trends Inf. Retr..

[21]  Sune Lehmann,et al.  Understanding the Demographics of Twitter Users , 2011, ICWSM.

[22]  Filippo Menczer,et al.  Early detection of promoted campaigns on social media , 2017, EPJ Data Science.

[23]  Kate M. Miltner,et al.  Big Data| Critiquing Big Data: Politics, Ethics, Epistemology | Special Section Introduction , 2014 .

[24]  D. Lazer,et al.  The Parable of Google Flu: Traps in Big Data Analysis , 2014, Science.

[25]  Michael Zimmer,et al.  A topology of Twitter research: disciplines, methods, and ethics , 2014, Aslib J. Inf. Manag..

[26]  Axel Bruns,et al.  Easy data, hard data: The politics and pragmatics of Twitter research after the computational turn , 2015 .

[27]  D. Ruths,et al.  Social media for large studies of behavior , 2014, Science.

[28]  Eric Gilbert,et al.  VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text , 2014, ICWSM.

[29]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[30]  Simon Hegelich,et al.  Are Social Bots on Twitter Political Actors? Empirical Evidence from a Ukrainian Social Botnet , 2016, ICWSM.

[31]  N. Newman,et al.  Reuters Institute Digital News Report 2019 , 2019 .

[32]  Jong Kim,et al.  Early filtering of ephemeral malicious accounts on Twitter , 2014, Comput. Commun..

[33]  Carlos Castillo,et al.  Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries , 2019, Front. Big Data.

[34]  Filippo Menczer,et al.  Online Human-Bot Interactions: Detection, Estimation, and Characterization , 2017, ICWSM.

[35]  Danah Boyd,et al.  Tweet, Tweet, Retweet: Conversational Aspects of Retweeting on Twitter , 2010, 2010 43rd Hawaii International Conference on System Sciences.

[36]  Robert Roedler,et al.  On the endogenesis of Twitter's Spritzer and Gardenhose sample streams , 2014, 2014 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2014).

[37]  Clay Shirky The political power of social media: Technology, the public sphere, and political change , 2011 .

[38]  Axel Bruns,et al.  Compromised Data: From Social Media to Big Data , 2015 .

[39]  Jürgen Pfeffer,et al.  Population Bias in Geotagged Tweets , 2015, Proceedings of the International AAAI Conference on Web and Social Media.

[40]  Leysia Palen,et al.  Twitter adoption and use in mass convergence and emergency events , 2009 .

[41]  James W. Pennebaker,et al.  Linguistic Inquiry and Word Count (LIWC2007) , 2007 .

[42]  Ning Wang,et al.  Assessing the bias in samples of large online networks , 2014, Soc. Networks.

[43]  Shirley Williams,et al.  What do people study when they study Twitter? Classifying Twitter related academic papers , 2013, J. Documentation.

[44]  Ashkan Sami,et al.  Entropy-based outlier detection using semi-supervised approach with few positive examples , 2014, Pattern Recognit. Lett..

[45]  K. Crawford,et al.  The limits of crisis data: analytical and ethical challenges of using social and mobile data to understand disasters , 2015 .

[46]  Julian Ausserhofer,et al.  NATIONAL POLITICS ON TWITTER , 2013 .

[47]  W. M. Wan,et al.  The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD , 2011 .

[48]  G. King,et al.  Ensuring the Data-Rich Future of the Social Sciences , 2011, Science.

[49]  Emily Gray,et al.  Cooking up healthy citizens: The pedagogy of cookbooks , 2013 .

[50]  Stefan Stieglitz,et al.  Quantitative Approaches to Comparing Communication Patterns on Twitter , 2012 .

[51]  Isabell M. Welpe,et al.  Predicting Elections with Twitter: What 140 Characters Reveal about Political Sentiment , 2010, ICWSM.

[52]  David Lazer,et al.  Measuring Price Discrimination and Steering on E-commerce Web Sites , 2014, Internet Measurement Conference.

[53]  Kyumin Lee,et al.  Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter , 2011, ICWSM.

[54]  Amit P. Sheth,et al.  Cursing in English on twitter , 2014, CSCW.

[55]  Viktor Mayer-Schnberger,et al.  Big Data: A Revolution That Will Transform How We Live, Work, and Think , 2013 .

[56]  Daniel Gayo-Avello,et al.  A Meta-Analysis of State-of-the-Art Electoral Prediction From Twitter Data , 2012, ArXiv.

[57]  Alessandro Vespignani,et al.  Online social networks and offline protest , 2015, EPJ Data Science.

[58]  Filippo Menczer,et al.  Anatomy of an online misinformation network , 2018, PloS one.

[59]  M. Bastos Shares, Pins, and Tweets , 2015 .

[60]  Kim Christian Schrøder,et al.  The Relative Importance of Social Media for Accessing, Finding, and Engaging with News , 2014 .

[61]  Kevin Crowston,et al.  Validity Issues in the Use of Social Network Analysis with Digital Trace Data , 2011, J. Assoc. Inf. Syst..

[62]  Kevin Driscoll,et al.  Big Data, Big Questions| Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data , 2014 .

[63]  Sushil Jajodia,et al.  Who is tweeting on Twitter: human, bot, or cyborg? , 2010, ACSAC '10.

[64]  Katharine Armstrong,et al.  Big data: a revolution that will transform how we live, work, and think , 2014 .

[65]  Yutaka Matsuo,et al.  Earthquake shakes Twitter users: real-time event detection by social sensors , 2010, WWW '10.

[66]  Zeynep Tufekci,et al.  Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls , 2014, ICWSM.

[67]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[68]  Markus Strohmaier,et al.  Sampling from Social Networks with Attributes , 2017, WWW.

[69]  Ophir Frieder,et al.  Effects of Sampling on Twitter Trend Detection , 2016, LREC.

[70]  Taha Yasseri,et al.  A Biased Review of Biases in Twitter Studies on Political Collective Action , 2016, Front. Phys..

[71]  Vaibhavi N Patodkar,et al.  Twitter as a Corpus for Sentiment Analysis and Opinion Mining , 2016 .

[72]  Sang Joon Kim,et al.  A Mathematical Theory of Communication , 2006 .