A large-scale Twitter dataset for drug safety applications mined from publicly existing resources

With the increase in popularity of deep learning models for natural language processing (NLP) tasks, in the field of Pharmacovigilance, more specifically for the identification of Adverse Drug Reactions (ADRs), there is an inherent need for large-scale social-media datasets aimed at such tasks. With most researchers allocating large amounts of time to crawl Twitter or buying expensive pre-curated datasets, then manually annotating by humans, these approaches do not scale well as more and more data keeps flowing in Twitter. In this work we re-purpose a publicly available archived dataset of more than 9.4 billion Tweets with the objective of creating a very large dataset of drug usage-related tweets. Using existing manually curated datasets from the literature, we then validate our filtered tweets for relevance using machine learning methods, with the end result of a publicly available dataset of 1,181,993 million tweets for public use. We provide all code and detailed procedure on how to extract this dataset and the selected tweet ids for researchers to use.

[1]  Xiaogang Wang,et al.  Deep Self-Learning From Noisy Labels , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[2]  Abeed Sarker,et al.  A corpus for mining drug-related knowledge from Twitter chatter: Language models and their utilities , 2016, Data in brief.

[3]  Casey S Greene,et al.  Celebrating parasites , 2017, Nature Genetics.

[4]  Janet Sultana,et al.  Clinical and economic burden of adverse drug reactions , 2013, Journal of pharmacology & pharmacotherapeutics.

[5]  Azadeh Nikfarjam,et al.  Mining Twitter for Adverse Drug Reaction Mentions : A Corpus and Classification Benchmark , 2014 .

[6]  Dietrich Klakow,et al.  Handling Noisy Labels for Robustly Learning from Self-Training Data for Low-Resource Sequence Labeling , 2019, NAACL.

[7]  Marcia McNutt,et al.  Data sharing , 2016, Science.

[8]  Graciela Gonzalez-Hernandez,et al.  Utilizing social media data for pharmacovigilance: A review , 2015, J. Biomed. Informatics.

[9]  Graciela Gonzalez-Hernandez,et al.  Pharmacovigilance on Twitter? Mining Tweets for Adverse Drug Reactions , 2014, AMIA.

[10]  Michael D. Barnes,et al.  Tweaking and Tweeting: Exploring Twitter for Nonmedical Use of a Psychostimulant Drug (Adderall) Among College Students , 2013, Journal of medical Internet research.

[11]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[12]  Gaël Varoquaux,et al.  Scikit-learn: Machine Learning in Python , 2011, J. Mach. Learn. Res..

[13]  Edoardo Pasolli,et al.  Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights , 2016, PLoS Comput. Biol..

[14]  Elizabeth Warren Strengthening Research through Data Sharing. , 2016, The New England journal of medicine.

[15]  Farid Neema,et al.  Data sharing , 1998 .

[16]  Scott H. Burton,et al.  An Exploration of Social Circles and Prescription Drug Abuse Through Twitter , 2013, Journal of medical Internet research.

[17]  Carsten Denkert,et al.  Cutoff Finder: A Comprehensive and Straightforward Web Application Enabling Rapid Biomarker Cutoff Optimization , 2012, PloS one.

[18]  L. Hazell,et al.  Under-Reporting of Adverse Drug Reactions , 2006, Drug safety.

[19]  Matthias Dehmer,et al.  Against Dataism and for Data Sharing of Big Biomedical and Clinical Data with Research Parasites , 2016, Front. Genet..

[20]  B. Kahle THE INTERNET ARCHIVE , 2012 .

[21]  S. Friend,et al.  Crowdsourcing biomedical research: leveraging communities as innovation engines , 2016, Nature Reviews Genetics.

[22]  Abeed Sarker,et al.  Detecting Personal Medication Intake in Twitter: An Annotated Corpus and Baseline Classification System , 2017, BioNLP.

[23]  Nigam H. Shah,et al.  Learning statistical models of phenotypes using noisy labeled training data , 2016, J. Am. Medical Informatics Assoc..