Research and applications: Induced lexico-syntactic patterns improve information extraction from online medical forums

OBJECTIVE To reliably extract two entity types, symptoms and conditions (SCs), and drugs and treatments (DTs), from patient-authored text (PAT) by learning lexico-syntactic patterns from data annotated with seed dictionaries. BACKGROUND AND SIGNIFICANCE Despite the increasing quantity of PAT (eg, online discussion threads), tools for identifying medical entities in PAT are limited. When applied to PAT, existing tools either fail to identify specific entity types or perform poorly. Identification of SC and DT terms in PAT would enable exploration of efficacy and side effects for not only pharmaceutical drugs, but also for home remedies and components of daily care. MATERIALS AND METHODS We use SC and DT term dictionaries compiled from online sources to label several discussion forums from MedHelp (http://www.medhelp.org). We then iteratively induce lexico-syntactic patterns corresponding strongly to each entity type to extract new SC and DT terms. RESULTS Our system is able to extract symptom descriptions and treatments absent from our original dictionaries, such as 'LADA', 'stabbing pain', and 'cinnamon pills'. Our system extracts DT terms with 58-70% F1 score and SC terms with 66-76% F1 score on two forums from MedHelp. We show improvements over MetaMap, OBA, a conditional random field-based classifier, and a previous pattern learning approach. CONCLUSIONS Our entity extractor based on lexico-syntactic patterns is a successful and preferable technique for identifying specific entity types in PAT. To the best of our knowledge, this is the first paper to extract SC and DT entities from PAT. We exhibit learning of informal terms often used in PAT but missing from typical dictionaries.

[1]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[2]  Jian Yang,et al.  Towards Internet-Age Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts in Health-Related Social Networks , 2010, BioNLP@ACL.

[3]  L MercerRobert,et al.  Class-based n-gram models of natural language , 1992 .

[4]  Catherine Arnott-Smith,et al.  PatientsLikeMe: Consumer Health Vocabulary as a Folksonomy , 2008, AMIA.

[5]  黄亚明,et al.  RxList , 2012 .

[6]  Mark A. Musen,et al.  The Open Biomedical Annotator , 2009, Summit on translational bioinformatics.

[7]  M. Massagli,et al.  Accelerated clinical discovery using self-reported patient data collected online and a patient-matching algorithm , 2011, Nature Biotechnology.

[8]  Wanda Pratt,et al.  A Study of Biomedical Concept Identification: MetaMap vs. People , 2003, AMIA.

[9]  Erik M. van Mulligen,et al.  Using rule-based natural language processing to improve disease normalization in biomedical text , 2012, J. Am. Medical Informatics Assoc..

[10]  Eleftherios Mylonakis,et al.  Google trends: a web-based tool for real-time surveillance of disease outbreaks. , 2009, Clinical infectious diseases : an official publication of the Infectious Diseases Society of America.

[11]  Alam Khan,et al.  Cinnamon improves glucose and lipids of people with type 2 diabetes. , 2003, Diabetes care.

[12]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[13]  黄亚明,et al.  MedicineNet , 2012 .

[14]  Carl Vogel,et al.  Proceedings of the 16th International Conference on Computational Linguistics , 1996, COLING 1996.

[15]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[16]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[17]  Christopher D. Manning,et al.  Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling , 2005, ACL.

[18]  Percy Liang,et al.  Semi-Supervised Learning for Natural Language , 2005 .

[19]  Amar K. Das,et al.  Unsupervised Method for Automatic Construction of a Disease Dictionary from a Large Free Text Collection , 2008, AMIA.

[20]  王林,et al.  MedlinePlus , 2010 .

[21]  Ellen Riloff,et al.  A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts , 2002, EMNLP.

[22]  王林,et al.  MedHelp , 2011 .

[23]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[24]  Robert L. Mercer,et al.  Class-Based n-gram Models of Natural Language , 1992, CL.

[25]  Sophia Ananiadou,et al.  BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing , 2012 .

[26]  Declan Butler,et al.  When Google got flu wrong , 2013, Nature.

[27]  Jeffrey Heer,et al.  Identifying medical terms in patient-authored text: a crowdsourcing-based approach , 2013, J. Am. Medical Informatics Assoc..

[28]  Russ B. Altman,et al.  A novel signal detection algorithm for identifying hidden drug-drug interactions in adverse event reports , 2012, J. Am. Medical Informatics Assoc..

[29]  A C Frati-Munari,et al.  Hypoglycemic Effect of Opuntia streptacantha Lemaire in NIDDM , 1988, Diabetes Care.

[30]  Ryen W. White,et al.  Web-scale pharmacovigilance: listening to signals from the crowd , 2013, J. Am. Medical Informatics Assoc..

[31]  Barry Smith,et al.  Proceedings of the AMIA Symposium , 2005 .

[32]  Joshua C. Denny,et al.  Automated identification of drug and food allergies entered using non-standard terminology , 2013, J. Am. Medical Informatics Assoc..