Textual and Informational Characteristics of Health-Related Social Media Content: A Study of Drug Review Forums

There is a proliferation of health-related social media sites where people post information about their diseases and treatments. These sites can be mined for information about users‘ experience with these diseases and treatments. This paper reports the results of an initial study of the informational content and linguistic characteristics of postings on drug review sites—with an emphasis on the opinions and sentiments expressed. This paper reports our initial analysis of the informational and linguistic characteristics of user postings on drug-review discussion forums. We investigate on knowledge they contain and information that can be extracted from them. We harvested postings from three websites carrying different kinds of user-generated reviews. We analyzed the corpus to identify the most-reviewed drugs, the vocabulary used, focusing on opinion words, and textual characteristics such as length of postings, sentence length and proportion of the various parts-of-speech. We performed semantic tagging with concepts from the UMLS metathesaurus and analyzed the distribution of medical concepts in the corpus. Our results indicate that the corpus covers a large variety of drugs. Drugs related to depression, anxiety, weight loss, and pain relief are most frequently reviewed. Although the linguistic quality of the text is lower than in scientific writing, the medical content is very rich. Opinion mining can be performed on the corpus since it contains many opinion terms.