Many governments impose traditional censorship methods on social media platforms. Instead of removing it completely, many social media companies, including Twitter, only withhold the content from the requesting country. This makes such content still accessible outside of the censored region, allowing for an excellent setting in which to study government censorship on social media. We mine such content using the Internet Archive's Twitter Stream Grab. We release a dataset of 583,437 tweets by 155,715 users that were censored between 2012-2020 July. We also release 4,301 accounts that were censored in their entirety. Additionally, we release a set of 22,083,759 supplemental tweets made up of all tweets by users with at least one censored tweet as well as instances of other users retweeting the censored user. We provide an exploratory analysis of this dataset. Our dataset will not only aid in the study of government censorship but will also aid in studying hate speech detection and the effect of censorship on social media users. The dataset is publicly available at https://doi.org/10.5281/zenodo.4439509
[1]
Rebekah Overdorf,et al.
Misleading Repurposing on Twitter
,
2020,
arXiv.org.
[2]
Dan S. Wallach,et al.
Known Unknowns: An Analysis of Twitter Censorship in Turkey
,
2015,
WPES@CCS.
[3]
Onur Varol,et al.
Spatiotemporal analysis of censored content on Twitter
,
2016,
WebSci.
[4]
Ingmar Weber,et al.
Automated Hate Speech Detection and the Problem of Offensive Language
,
2017,
ICWSM.
[5]
Dan S. Wallach,et al.
The Decline of Social Media Censorship and the Rise of Self-Censorship after the 2016 Failed Turkish Coup
,
2017,
FOCI @ USENIX Security Symposium.
[6]
Juan M. Banda,et al.
Mining Archive.org’s Twitter Stream Grab for Pharmacovigilance Research Gold
,
2019,
bioRxiv.
[7]
Ahmed Furkan Özkalay,et al.
The Power of Deletions: Ephemeral Astroturfing Attacks on Twitter Trends
,
2019
.
[8]
Devika Subramanian,et al.
Detecting Influential Users and Communities in Censored Tweets Using Data-Flow Graphs
,
2016
.