The Pushshift Reddit Dataset

Social media data has become crucial to the advancement of scientific understanding. However, even though it has become ubiquitous, just collecting large-scale social media data involves a high degree of engineering skill set and computational resources. In fact, research is often times gated by data engineering problems that must be overcome before analysis can proceed. This has resulted recognition of datasets as meaningful research contributions in and of themselves. Reddit, the so called "front page of the Internet," in particular has been the subject of numerous scientific studies. Although Reddit is relatively open to data acquisition compared to social media platforms like Facebook and Twitter, the technical barriers to acquisition still remain. Thus, Reddit's millions of subreddits, hundreds of millions of users, and hundreds of billions of comments are at the same time relatively accessible, but time consuming to collect and analyze systematically. In this paper, we present the Pushshift Reddit dataset. Pushshift is a social media data collection, analysis, and archiving platform that since 2015 has collected Reddit data and made it available to researchers. Pushshift's Reddit dataset is updated in real-time, and includes historical data back to Reddit's inception. In addition to monthly dumps, Pushshift provides computational tools to aid in searching, aggregating, and performing exploratory analysis on the entirety of the dataset. The Pushshift Reddit dataset makes it possible for social media researchers to reduce time spent in the data collection, cleaning, and storage phases of their projects.

[1]  Jason Chuang,et al.  Large-Scale Topical Analysis of Multiple Online News Sources with Media Cloud , 2014 .

[2]  Elizabeth Gibney Privacy hurdles thwart Facebook democracy research , 2019, Nature.

[3]  Gabriel Skantze,et al.  Crowdsourcing a self-evolving dialog graph , 2019, CUI.

[4]  Amy Bruckman,et al.  Does Transparency in Moderation Really Matter? , 2019, Proc. ACM Hum. Comput. Interact..

[5]  Michael Mattioli,et al.  Big data, bigger dilemmas: A critical review , 2015, J. Assoc. Inf. Sci. Technol..

[6]  Elinor Ostrom,et al.  Ideas, Artifacts, and Facilities: Information as a Common-Pool Resource , 2003 .

[7]  Abdolreza Abhari,et al.  Using Deep Learning to Recommend Discussion Threads to Users in an Online Forum , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[8]  Benno Stein,et al.  TL;DR: Mining Reddit to Learn Automatic Summarization , 2017, NFiS@EMNLP.

[9]  Nathalie Japkowicz,et al.  Towards Ethical Content-Based Detection Of Online Influence Campaigns , 2019, 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP).

[10]  Tilmann Rabl,et al.  An Intermediate Representation for Optimizing Machine Learning Pipelines , 2019, Proc. VLDB Endow..

[11]  Carlos Castillo,et al.  Social Data: Biases, Methodological Pitfalls, and Ethical Boundaries , 2019, Front. Big Data.

[12]  Sumaru Niida,et al.  The Impact of Social Network Structure on the Growth and Survival of Online Communities , 2019, 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM).

[13]  Sune Lehmann,et al.  Accelerating dynamics of collective attention , 2019, Nature Communications.

[14]  Chenhao Tan,et al.  Tracing Community Genealogy: How New Communities Emerge from the Old , 2018, ICWSM.

[15]  Gianluca Stringhini,et al.  Screenshot Classifier annotated images pHashes of non-screenshot annotated images Know Your Meme Generic Annotation Sites Meme Annotation Sites Generic Web Communities , 2018 .

[16]  Ali Ahmadvand,et al.  ConCET: Entity-Aware Topic Classification for Open-Domain Conversational Agents , 2019, CIKM.

[17]  J. Nathan Matias,et al.  Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus , 2018, PloS one.

[18]  Gloria Mark,et al.  Detecting Potential Warning Behaviors of Ideological Radicalization in an Alt-Right Subreddit , 2019, ICWSM.

[19]  Gianluca Stringhini,et al.  Understanding Web Archiving Services and Their (Mis)Use on Social Media , 2018, ICWSM.

[20]  D. Lazer,et al.  Data ex Machina: Introduction to Big Data , 2017 .

[21]  Sibel Adali,et al.  The Impact of Crowds on News Engagement: A Reddit Case Study , 2017, Proceedings of the International AAAI Conference on Web and Social Media.

[22]  Tilmann Rabl,et al.  ScootR: Scaling R Dataframes on Dataflow Systems , 2018, SoCC.

[23]  Tilmann Rabl,et al.  BlockJoin: Efficient Matrix Partitioning Through Joins , 2017, Proc. VLDB Endow..

[24]  Çağrı Çöltekin,et al.  Identifying Depression on Reddit: The Effect of Training Data , 2018, EMNLP 2018.

[25]  Daniel Arthur Hunter Ra,et al.  Cyberspace as Place, and the Tragedy of the Digital Anticommons , 2002 .

[26]  Kate Starbird,et al.  Examining the Alternative Media Ecosystem Through the Production of Alternative Narratives of Mass Shooting Events on Twitter , 2017, ICWSM.

[27]  Wei Wang,et al.  Learning to Disentangle Interleaved Conversational Threads with a Siamese Hierarchical Network and Similarity Ranking , 2018, NAACL.

[28]  Alessio Botta,et al.  Monetizing data: A new source of value in payments , 2017 .

[29]  Bernard J. Jansen,et al.  View, Like, Comment, Post: Analyzing User Engagement by Topic at 4 Levels across 5 Social Media Platforms for 53 News Organizations , 2019, ICWSM.

[30]  Eric Gilbert,et al.  The Internet's Hidden Rules , 2018, Proceedings of the ACM on Human-Computer Interaction.

[31]  Jean-Charles Delvenne,et al.  Modelling structure and predicting dynamics of discussion threads in online boards , 2018, J. Complex Networks.

[32]  John Kelly,et al.  Polarization, Partisanship and Junk News Consumption over Social Media in the US , 2018, ArXiv.

[33]  Scott J Leischow,et al.  Underage JUUL Use Patterns: Content Analysis of Reddit Messages , 2019, Journal of medical Internet research.

[34]  Kathy McKeown,et al.  Fixed That for You: Generating Contrastive Claims with Semantic Edits , 2019, NAACL.

[35]  Wenji Mao,et al.  Social Computing: From Social Informatics to Social Intelligence , 2007, IEEE Intell. Syst..

[36]  Ana Paula Couto da Silva,et al.  Online Social Networks in Health Care: A Study of Mental Disorders on Reddit , 2018, 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[37]  Cornelius Puschmann An end to the wild west of social media research: a response to Axel Bruns , 2019, Information, Communication & Society.

[38]  Gianluca Stringhini,et al.  Who Let The Trolls Out?: Towards Understanding State-Sponsored Trolls , 2018, WebSci.

[39]  Cliff Lampe,et al.  Big Data in Survey Research AAPOR Task Force Report , 2015 .

[40]  G. N. Gilbert Computational Social Science , 2010 .

[41]  Keith N. Hampton Studying the Digital: Directions and Challenges for Digital Methods , 2017 .

[42]  Mark Dredze,et al.  Elites and foreign actors among the alt-right: The Gab social media platform , 2019, First Monday.

[43]  David A. Broniatowski,et al.  Characterizing Trends in Human Papillomavirus Vaccine Discourse on Reddit (2007-2015): An Observational Study , 2019, JMIR Public Health and Surveillance.

[44]  Scott A. Golder,et al.  Digital Footprints: Opportunities and Challenges for Online Social Research , 2014 .

[45]  J. Nathan Matias,et al.  Going Dark: Social Factors in Collective Action Against Platform Operators in the Reddit Blackout , 2016, CHI.

[46]  David M. Mimno,et al.  Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity , 2017, WWW.

[47]  Tawfiq Ammari,et al.  Self-declared Throwaway Accounts on Reddit , 2019, Proceedings of the ACM on Human-Computer Interaction.

[48]  D. Boyd Untangling research and practice: What Facebook’s “emotional contagion” study teaches us , 2016 .

[49]  Alexander Halavais Overcoming terms of service: a proposal for ethical distributed research , 2019, Information, Communication & Society.

[50]  Denis Helic,et al.  Evaluating narrative-driven movie recommendations on Reddit , 2019, IUI.

[51]  Denis Helic,et al.  Modeling User Dynamics in Collaboration Websites , 2017, Dynamics On and Of Complex Networks III.

[52]  Georgios Gousios,et al.  The GHTorent dataset and tool suite , 2013, 2013 10th Working Conference on Mining Software Repositories (MSR).

[53]  Venkata Rama Kiran Garimella,et al.  WhatsApp, Doc? A First Look at WhatsApp Public Group Data , 2018, ICWSM 2018.

[54]  Jacob A Rohde,et al.  Topic Clustering of E-Cigarette Submissions Among Reddit Communities: A Network Perspective , 2019, Health education & behavior : the official publication of the Society for Public Health Education.

[55]  Gisele L. Pappa,et al.  Reddit Weight Loss Communities: Do They Have What It Takes for Effective Health Interventions? , 2018, 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[56]  Jacob Eisenstein,et al.  You Can't Stay Here , 2017 .

[57]  Alice E. Marwick,et al.  Media Manipulation and Disinformation Online , 2017 .

[58]  V. Burris,et al.  White Supremacist Networks on the Internet , 2000 .

[59]  Nancy Fulda,et al.  Semantically Aligned Sentence-Level Embeddings for Agent Autonomy and Natural Language Understanding , 2019 .

[60]  Georgios Paliouras,et al.  TimeRank: A Random Walk Approach for Community Discovery in Dynamic Networks , 2018, COMPLEX NETWORKS.

[61]  Katrin Weller,et al.  A manifesto for data sharing in social media research , 2016, WebSci.

[62]  Dan Mercea,et al.  The disinformation landscape and the lockdown of social platforms , 2019, Information, Communication & Society.

[63]  Xuan Zhu,et al.  Quantifying Context Overlap for Training Word Embeddings , 2018, EMNLP.

[64]  L. Manovich,et al.  Trending: The Promises and the Challenges of Big Social Data , 2012 .

[65]  Sergey I. Nikolenko,et al.  Lost in Conversation: A Conversational Agent Based on the Transformer and Transfer Learning , 2019 .

[66]  Fulya Ozcan Bayesian Nonparametric Models on Big Data , 2017 .

[67]  Jeffrey Mervis Privacy concerns could derail Facebook data-sharing plan. , 2019, Science.

[68]  Srayan Datta,et al.  Identifying Misaligned Inter-Group Links and Communities , 2017, Proc. ACM Hum. Comput. Interact..

[69]  L. Palen,et al.  Crisis informatics—New data for extraordinary times , 2016, Science.

[70]  Jure Leskovec,et al.  Community Interaction and Conflict on the Web , 2018, WWW.

[71]  Carlos Guestrin,et al.  The Rise and Fall of Network Stars , 2017, Inf. Process. Manag..

[72]  Zeynep Tufekci,et al.  Big Questions for Social Media Big Data: Representativeness, Validity and Other Methodological Pitfalls , 2014, ICWSM.

[73]  K. Erikson,et al.  Discovering the Social , 2018 .

[74]  Jisun An,et al.  Political Discussions in Homogeneous and Cross-Cutting Communication Spaces , 2019, ICWSM.

[75]  Cristian Danescu-Niculescu-Mizil,et al.  Content Removal as a Moderation Strategy , 2019, Proc. ACM Hum. Comput. Interact..

[76]  Adrienne Massanari,et al.  #Gamergate and The Fappening: How Reddit’s algorithm, governance, and culture support toxic technocultures , 2017, New Media Soc..

[77]  Eshwar Chandrasekharan,et al.  Crossmod: A Cross-Community Learning-based System to Assist Reddit Moderators , 2019, Proc. ACM Hum. Comput. Interact..

[78]  Daniela Stan Raicu,et al.  Automatic extraction of informal topics from online suicidal ideation , 2018, BMC Bioinformatics.

[79]  Tim Squirrell,et al.  Platform dialectics: The relationships between volunteer moderators and end users on reddit , 2019, New Media Soc..

[80]  Srayan Datta,et al.  Extracting Inter-community Conflicts in Reddit , 2018, ICWSM.

[81]  Mihael Arcan,et al.  First Insights on a Passive Major Depressive Disorder Prediction System with Incorporated Conversational Chatbot , 2018, AICS.

[82]  Gian Paolo Rossi,et al.  Mastodon Content Warnings: Inappropriate Contents in a Microblogging Platform , 2019, ICWSM.

[83]  Kevin Crowston,et al.  Validity Issues in the Use of Social Network Analysis with Digital Trace Data , 2011, J. Assoc. Inf. Syst..

[84]  Christine L. Borgman,et al.  The conundrum of sharing research data , 2012, J. Assoc. Inf. Sci. Technol..

[85]  Alex Wang,et al.  Can You Tell Me How to Get Past Sesame Street? Sentence-Level Pretraining Beyond Language Modeling , 2018, ACL.

[86]  Wen Zheng,et al.  Enhancing Conversational Dialogue Models with Grounded Knowledge , 2019, CIKM.

[87]  Maria Glenski,et al.  Characterizing Speed and Scale of Cryptocurrency Discussion Spread on Reddit , 2019, WWW.

[88]  Geoff Kaufman,et al.  Moderator engagement and community development in the age of algorithms , 2019, New Media Soc..

[89]  Jonathan Gemmell,et al.  Discovery of Informal Topics from Post Traumatic Stress Disorder Forums , 2017, 2017 IEEE International Conference on Data Mining Workshops (ICDMW).

[90]  Leon Derczynski,et al.  Results of the WNUT2017 Shared Task on Novel and Emerging Entity Recognition , 2017, NUT@EMNLP.

[91]  Axel Bruns,et al.  After the ‘APIcalypse’: social media platforms and their fight against critical scholarly research , 2019, Information, Communication & Society.

[92]  André Panisson,et al.  Firsthand Opiates Abuse on Social Media: Monitoring Geospatial Patterns of Interest Through a Digital Cohort , 2019, WWW.

[93]  E. Walker,et al.  A machine learning approach to predicting psychosis using semantic density and latent content analysis , 2019, npj Schizophrenia.

[94]  Harith Alani,et al.  Exploring Misogyny across the Manosphere in Reddit , 2019, WebSci.

[95]  Jonathan Gemmell,et al.  Detecting and Characterizing Trends in Online Mental Health Discussions , 2018, 2018 IEEE International Conference on Data Mining Workshops (ICDMW).

[96]  Chenhao Tan,et al.  Are All Successful Communities Alike? Characterizing and Predicting the Success of Online Communities , 2019, WWW.

[97]  Mohammad Al Hasan,et al.  Investigate Transitions into Drug Addiction through Text Mining of Reddit Data , 2019, KDD.

[98]  Pablo Gamallo,et al.  Contextualized Translations of Phrasal Verbs with Distributional Compositional Semantics and Monolingual Corpora , 2019, Computational Linguistics.

[99]  James Boyle,et al.  The Second Enclosure Movement and the Construction of the Public Domain , 2003 .

[100]  D. Boyd,et al.  CRITICAL QUESTIONS FOR BIG DATA , 2012 .

[101]  Eric Gilbert,et al.  The Bag of Communities: Identifying Abusive Behavior Online with Preexisting Internet Data , 2017, CHI.

[102]  Emily T Hébert,et al.  A content analysis of JUUL discussions on social media: Using Reddit to understand patterns and perceptions of JUUL use. , 2019, Drug and alcohol dependence.

[103]  Casey Fiesler,et al.  Reddit Rules! Characterizing an Ecosystem of Governance , 2018, ICWSM.

[104]  Deen Freelon,et al.  On the Interpretation of Digital Trace Data in Communication and Social Computing Research , 2014 .

[105]  Bernard J. Jansen,et al.  Detecting Toxicity Triggers in Online Discussions , 2019, HT.

[106]  Deen Freelon Computational Research in the Post-API Age , 2018, Political Communication.

[107]  Amy Bruckman,et al.  "Did You Suspect the Post Would be Removed?" , 2019, Proc. ACM Hum. Comput. Interact..

[108]  Steven A. Sumner,et al.  Increases in Online Posts About Synthetic Opioids Preceding Increases in Synthetic Opioid Death Rates: a Retrospective Observational Study , 2019, Journal of General Internal Medicine.

[109]  Andrew Johnston,et al.  Identifying Extremism in Text Using Deep Learning , 2020, Development and Analysis of Deep Learning Architectures.

[110]  C. Rosé,et al.  The Discourse of Online Content Moderation: Investigating Polarized User Responses to Changes in Reddit’s Quarantine Policy , 2019, Proceedings of the Third Workshop on Abusive Language Online.

[111]  Ryan Wesslen,et al.  Shouting into the Void: A Database of the Alternative Social Media Platform Gab , 2019, ICWSM.

[112]  J. Nathan Matias,et al.  The Civic Labor of Volunteer Moderators Online , 2019, Social Media + Society.