Unsupervised Classification of Health Content on Reddit

Online forums are easily accessible to the public and useful to acquire and disseminate health information, however, advanced methods have to be applied to correctly interpret the content. For this reason, we propose the application of an unsupervised embedding-based approach for health content classification. Specifically, we utilise word embeddings and a clustering method to create content-sensitive word clusters; we then align the health content with the clusters classifying it into illnesses/medication/disease agents. The results suggest that a cosine similarity of 0.70 is preferred for the creation of informative clusters as well as for the automatically generation of synonyms, acronyms, abbreviations and common misspellings. Our approach does not only demonstrate the potential given by discussion forums, in particular, Reddit, for unsupervised content classification but also for dictionary building from informal health content.

[1]  Ronen Feldman,et al.  Utilizing Text Mining on Online Medical Forums to Predict Label Change due to Adverse Drug Reactions , 2015, KDD.

[2]  W. Chou,et al.  Social Media Use in the United States: Implications for Health Communication , 2009, Journal of medical Internet research.

[3]  Hsinchun Chen,et al.  AZDrugMiner: An Information Extraction System for Mining Patient-Reported Adverse Drug Events in Online Patient Forums , 2013, ICSH.

[4]  Petr Sojka,et al.  Software Framework for Topic Modelling with Large Corpora , 2010 .

[5]  ChengXiang Zhai,et al.  Understanding User Intents in Online Health Forums , 2015, IEEE Journal of Biomedical and Health Informatics.

[6]  Julia Segar,et al.  “You get to know the people and whether they’re talking sense or not”: Negotiating trust on health-related forums , 2016, Social science & medicine.

[7]  Steven Bird,et al.  NLTK: The Natural Language Toolkit , 2002, ACL.

[8]  Roxana Girju,et al.  Identifying Medications that Patients Stopped Taking in Online Health Forums , 2017, 2017 IEEE 11th International Conference on Semantic Computing (ICSC).

[9]  Jeffrey Pennington,et al.  GloVe: Global Vectors for Word Representation , 2014, EMNLP.

[10]  C. Dolea,et al.  World Health Organization , 1949, International Organization.

[11]  T. Friede,et al.  Sources of Information and Behavioral Patterns in Online Health Forums: Observational Study , 2014, Journal of medical Internet research.

[12]  ChengXiang Zhai,et al.  SideEffectPTM: an unsupervised topic model to mine adverse drug reactions from health forums , 2014, BCB.

[13]  Akhil Kumar,et al.  Tell Me What I Don't Know--Making the Most of Social Health Forums , 2013, 2013 IEEE International Conference on Healthcare Informatics.

[14]  K. Gold,et al.  Internet message boards for pregnancy loss: who's on-line and why? , 2012, Women's health issues : official publication of the Jacobs Institute of Women's Health.

[15]  Pengzhu Zhang,et al.  Health-Related Hot Topic Detection in Online Communities Using Text Clustering , 2013, PloS one.

[16]  Nazli Goharian,et al.  Triaging content severity in online mental health forums , 2017, J. Assoc. Inf. Sci. Technol..

[17]  Thomas Wetter,et al.  Screening Internet forum participants for depression symptoms by assembling and enhancing multiple NLP methods , 2015, Comput. Methods Programs Biomed..

[18]  Tamsin Ford,et al.  Online discussion forums for young people who self-harm: user views , 2011 .

[19]  Martin Tanis,et al.  Health-Related On-Line Forums: What's the Big Attraction? , 2008, Journal of health communication.

[20]  D. Heymann,et al.  Public health surveillance , 2011 .

[21]  Hsinchun Chen,et al.  Text mining self‐disclosing health information for public health service , 2014, J. Assoc. Inf. Sci. Technol..

[22]  William Halperin,et al.  Public Health Surveillance , 2008 .

[23]  Bo Luo,et al.  Mining Adverse Drug Side-Effects from Online Medical Forums , 2012, 2012 IEEE Second International Conference on Healthcare Informatics, Imaging and Systems Biology.

[24]  Bruce R. Schatz,et al.  Designing and evaluating a clustering system for organizing and integrating patient drug outcomes in personal health messages , 2012, AMIA.

[25]  Ryen W. White,et al.  Labels for Disorder Mentions in Online Health Forums , 2013 .

[26]  Shanton Chang,et al.  Who can you trust? Credibility assessment in online health forums , 2014 .

[27]  Geoffrey Zweig,et al.  Linguistic Regularities in Continuous Space Word Representations , 2013, NAACL.

[28]  Hannah G Dahlen,et al.  A search for hope and understanding: an analysis of threatened miscarriage internet forums. , 2014, Midwifery.

[29]  Jyotishman Pathak,et al.  Evaluating the Process of Online Health Information Searching: A Qualitative Approach to Exploring Consumer Perspectives , 2014, Journal of medical Internet research.

[30]  Huan Liu,et al.  Context-Aware Experience Extraction from Online Health Forums , 2015, 2015 International Conference on Healthcare Informatics.

[31]  Marwan Bikdash,et al.  From social media to public health surveillance: Word embedding based clustering method for twitter classification , 2017, SoutheastCon 2017.

[32]  T. Kass-Hout,et al.  Social media in public health. , 2013, British medical bulletin.