Utilizing deep learning and graph mining to identify drug use on Twitter data

The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. Social media data (tweets and attributes) were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset of 3,696,150 rows. The predictive classification power of multiple methods was compared including SVM, XGBoost, BERT and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. To test the predictive capability of the model, SVM and XGBoost were first employed. The results calculated from the models respectively displayed an accuracy of 59.33% and 54.90%, with AUC’s of 0.87 and 0.71. The values show a low predictive capability with little discrimination. Conversely, the CNN-based classifiers presented a significant improvement, between the two models tested. The first was trained with 2661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. Predictive analysis with a CNN is promising, whereas attribute-based models presented little predictive capability and were not suitable for analyzing text of data. This research found that the commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased accuracy scores and improves the predictive capability.

[1]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[2]  N. Heaivilin,et al.  Public Health Surveillance of Dental Pain via Twitter , 2011, Journal of dental research.

[3]  Wesley De Neve,et al.  Multimedia Lab @ ACL WNUT NER Shared Task: Named Entity Recognition for Twitter Microposts using Distributed Word Representations , 2015, NUT@IJCNLP.

[4]  Roland L. Dunbrack,et al.  The Role of Balanced Training and Testing Data Sets for Binary Classifiers in Bioinformatics , 2013, PloS one.

[5]  Amit P. Sheth,et al.  What's ur Type? Contextualized Classification of User Types in Marijuana-Related Communications Using Compositional Multiview Embedding , 2018, 2018 IEEE/WIC/ACM International Conference on Web Intelligence (WI).

[6]  Annice Kim,et al.  Classification of Twitter Users Who Tweet About E-Cigarettes , 2017, JMIR public health and surveillance.

[7]  Neel Shah,et al.  A framework for social media data analytics using Elasticsearch and Kibana , 2018 .

[8]  G. Eysenbach,et al.  Pandemics in the Age of Twitter: Content Analysis of Tweets during the 2009 H1N1 Outbreak , 2010, PloS one.

[9]  Ramit Sawhney,et al.  Exploring and Learning Suicidal Ideation Connotations on Social Media with Deep Learning , 2018, WASSA@EMNLP.

[10]  Philippe J. Giabbanelli,et al.  Combining association rule mining and network analysis for pharmacosurveillance , 2016, The Journal of Supercomputing.

[11]  M. McHugh Interrater reliability: the kappa statistic , 2012, Biochemia medica.

[12]  T. Johnson Sources of Error in Substance Use Prevalence Surveys , 2014, International scholarly research notices.

[13]  Hilde van der Togt,et al.  Publisher's Note , 2003, J. Netw. Comput. Appl..

[14]  Patrick Chen,et al.  An Investigation of Decennial Census Effects on Estimates of Substance Use and Mental Illness from the National Survey on Drug Use and Health (NSDUH) , 2013 .

[15]  Soon Ae Chun,et al.  Deep Self-Taught Learning for Detecting Drug Abuse Risk Behavior in Tweets , 2018, CSoNet.

[16]  Erik Cambria,et al.  Bayesian Deep Convolution Belief Networks for Subjectivity Detection , 2016, 2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW).

[17]  Yoshua Bengio,et al.  Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[18]  Xiang Zhang,et al.  Character-level Convolutional Networks for Text Classification , 2015, NIPS.

[19]  Jeffrey Dean,et al.  Efficient Estimation of Word Representations in Vector Space , 2013, ICLR.

[20]  Vijay Kumar Mago,et al.  Birds of prey: identifying lexical irregularities in spam on Twitter , 2018 .

[21]  J. Fleiss Measuring nominal scale agreement among many raters. , 1971 .

[22]  Nello Cristianini,et al.  Flu Detector - Tracking Epidemics on Twitter , 2010, ECML/PKDD.

[23]  Mark Dredze,et al.  Measuring Post Traumatic Stress Disorder in Twitter , 2014, ICWSM.

[24]  Wei Gao,et al.  Rumor Detection on Twitter with Tree-structured Recursive Neural Networks , 2018, ACL.

[25]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[26]  Amit P. Sheth,et al.  PREDOSE: A semantic web platform for drug abuse epidemiology using social media , 2013, J. Biomed. Informatics.

[27]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[28]  Wei Zhang,et al.  Improvement of HITS-based algorithms on web documents , 2002, WWW '02.

[29]  Carol A Gotway Crawford,et al.  A New Source of Data for Public Health Surveillance: Facebook Likes , 2015, Journal of medical Internet research.

[30]  Rachel E. Ginn,et al.  Social Media Mining for Toxicovigilance: Automatic Monitoring of Prescription Medication Abuse from Twitter , 2016, Drug Safety.

[31]  Tom A. B. Snijders,et al.  Social Network Analysis , 2011, International Encyclopedia of Statistical Science.

[32]  Gautam Srivastava,et al.  Assessing Canadians Health Activity and Nutritional Habits Through Social Media , 2020, Frontiers in Public Health.

[33]  Jiawei Yuan,et al.  Mining Twitter to Assess the Public Perception of the “Internet of Things” , 2016, PloS one.

[34]  W. Bruce Croft,et al.  Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval , 2011, SIGIR.

[35]  Tianqi Chen,et al.  XGBoost: A Scalable Tree Boosting System , 2016, KDD.

[36]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[37]  Alessandro Moschitti,et al.  Twitter Sentiment Analysis with Deep Convolutional Neural Networks , 2015, SIGIR.

[38]  Mark Dredze,et al.  You Are What You Tweet: Analyzing Twitter for Public Health , 2011, ICWSM.

[39]  Jingcheng Du,et al.  Extracting psychiatric stressors for suicide from social media using deep learning , 2018, BMC Medical Informatics and Decision Making.