Mining Health-Related Issues in Consumer Product Reviews by Using Scalable Text Analytics

In an era when most of our life activities are digitized and recorded, opportunities abound to gain insights about population health. Online product reviews present a unique data source that is currently underexplored. Health-related information, although scarce, can be systematically mined in online product reviews. Leveraging natural language processing and machine learning tools, we were able to mine 1.3 million grocery product reviews for health-related information. The objectives of the study were as follows: (1) conduct quantitative and qualitative analysis on the types of health issues found in consumer product reviews; (2) develop a machine learning classifier to detect reviews that contain health-related issues; and (3) gain insights about the task characteristics and challenges for text analytics to guide future research.

[1]  Karin M. Verspoor,et al.  Towards Early Discovery of Salient Health Threats: A Social Media Emotion Classification Technique , 2016, PSB.

[2]  Manabu Torii,et al.  Risk factor detection for heart disease by applying text analytics in electronic medical records , 2015, J. Biomed. Informatics.

[3]  Sanjay Ghemawat,et al.  MapReduce: Simplified Data Processing on Large Clusters , 2004, OSDI.

[4]  James Pustejovsky,et al.  Are You Sure That This Happened? Assessing the Factuality Degree of Events in Text , 2012, CL.

[5]  S. Schneeweiss Learning from big health care data. , 2014, The New England journal of medicine.

[6]  Graciela Gonzalez-Hernandez,et al.  Utilizing social media data for pharmacovigilance: A review , 2015, J. Biomed. Informatics.

[7]  Viju Raghupathi,et al.  Big data analytics in healthcare: promise and potential , 2014, Health Information Science and Systems.

[8]  Tong Zhang,et al.  Text Categorization Based on Regularized Linear Classification Methods , 2001, Information Retrieval.

[9]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[10]  Armin R. Mikler,et al.  Text and Structural Data Mining of Influenza Mentions in Web and Social Media , 2010, International journal of environmental research and public health.

[11]  Prakash M. Nadkarni,et al.  Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions , 2011, J. Am. Medical Informatics Assoc..

[12]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[13]  Guy Lapalme,et al.  Query-Based Summarization of Customer Reviews , 2007, Canadian Conference on AI.

[14]  Christopher D. Manning,et al.  Introduction to Information Retrieval , 2010, J. Assoc. Inf. Sci. Technol..

[15]  David A. Ferrucci,et al.  UIMA: an architectural approach to unstructured information processing in the corporate research environment , 2004, Natural Language Engineering.

[16]  Abeed Sarker,et al.  Finding Potentially Unsafe Nutritional Supplements from User Reviews with Topic Modeling , 2016, PSB.

[17]  Elizabeth Sklar,et al.  Longitudinal analysis of discussion topics in an online breast cancer community using convolutional neural networks , 2016, J. Biomed. Informatics.

[18]  Yaoyun Zhang,et al.  Clinical Abbreviation Disambiguation Using Neural Word Embeddings , 2015, BioNLP@IJCNLP.

[19]  Shuying Shen,et al.  2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text , 2011, J. Am. Medical Informatics Assoc..

[20]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[21]  Songbo Tan,et al.  A survey on sentiment detection of reviews , 2009, Expert Syst. Appl..

[22]  Richard Bonneau,et al.  Text Classification for Automatic Detection of E-Cigarette Use and Use for Smoking Cessation from Twitter: A Feasibility Pilot , 2016, PSB.

[23]  Elena García Barriocanal,et al.  Evaluating content quality and helpfulness of online product reviews: The interplay of review helpfulness vs. review content , 2012, Electron. Commer. Res. Appl..

[24]  Michael J. Franklin,et al.  Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing , 2012, NSDI.

[25]  Jure Leskovec,et al.  Inferring Networks of Substitutable and Complementary Products , 2015, KDD.

[26]  Giannis Tzimas,et al.  Large Scale Sentiment Analysis on Twitter with Spark , 2016, EDBT/ICDT Workshops.