Redundancy Reduction in Twitter Event Streams

The data from social networks like Twitter is a valuable source for research but full of redundancy, making it hard to provide large-scale, self-contained, and small datasets. The data recording is a common problem in social media-based studies and could be standardized. Sadly, this is hardly done. This paper reports on lessons learned from a long-term evaluation study recording the complete public sample of the German and English Twitter stream. It presents a recording solution proposal that merely chunks a linear stream of events to reduce redundancy. If events are observed multiple times within the time-span of a chunk, only the latest observation is written to the chunk. A 10 Gigabyte Twitter raw dataset covering 1,2 Million Tweets of 120.000 users recorded between June and September 2017 was used to analyze expectable compression rates. It turned out that resulting datasets need only between 10% and 20% of the original data size without losing any event, metadata or the relationships between single events. This kind of redundancy reduction recording makes it possible to curate large-scale (even nation-wide), self-contained, and small datasets of social networks for research in a standardized and reproducible manner.

[1]  Tomás Baviera,et al.  Mediatisation in Twitter: an exploratory analysis of the 2015 Spanish general election , 2019, The Journal of International Communication.

[2]  Nane Kratzke,et al.  The #BTW17 Twitter Dataset-Recorded Tweets of the Federal Election Campaigns of 2017 for the 19th German Bundestag , 2017, Data.

[3]  Yazhe Wang,et al.  Should We Use the Sample? Analyzing Datasets Sampled from Twitter’s Stream API , 2015, ACM Trans. Web.

[4]  Jorge Ferraz de Abreu,et al.  From Live TV Events to Twitter Status Updates - a Study on Delays , 2016, jAUTI.

[5]  Steffen Staab,et al.  Systematically Monitoring Social Media: The case of the German federal election 2017 , 2018, ArXiv.

[6]  Filippo Menczer,et al.  BotSlayer: real-time detection of bot amplification on Twitter , 2019, J. Open Source Softw..

[7]  Yong-Yeol Ahn,et al.  Community-Based Event Detection in Temporal Networks , 2019, Scientific Reports.

[8]  J. Cook,et al.  Twitter Adoption and Activity in U.S. Legislatures: A 50-State Study , 2017 .

[9]  Anthony Stefanidis,et al.  #Earthquake: Twitter as a Distributed Sensor System , 2013, Trans. GIS.

[10]  Filippo Menczer,et al.  Anatomy of an online misinformation network , 2018, PloS one.

[11]  Panagiotis Takis Metaxas,et al.  Limits of Electoral Predictions Using Twitter , 2011, ICWSM.

[12]  Silvio Waisbord,et al.  Populist communication by digital means: presidential Twitter in Latin America , 2017 .

[13]  Pablo Barberá,et al.  Understanding the Political Representativeness of Twitter Users , 2015 .

[14]  Huan Liu,et al.  Is the Sample Good Enough? Comparing Data from Twitter's Streaming API with Twitter's Firehose , 2013, ICWSM.

[15]  Maurizio Tesconi,et al.  RTbust: Exploiting Temporal Patterns for Botnet Detection on Twitter , 2019, WebSci.

[16]  Shrikanth S. Narayanan,et al.  A System for Real-time Twitter Sentiment Analysis of 2012 U.S. Presidential Election Cycle , 2012, ACL.

[17]  Shamik Sural,et al.  Online Public Shaming on Twitter: Detection, Analysis, and Mitigation , 2019, IEEE Transactions on Computational Social Systems.