Medical concept normalization in social media posts with recurrent neural networks

Text mining of scientific libraries and social media has already proven itself as a reliable tool for drug repurposing and hypothesis generation. The task of mapping a disease mention to a concept in a controlled vocabulary, typically to the standard thesaurus in the Unified Medical Language System (UMLS), is known as medical concept normalization. This task is challenging due to the differences in the use of medical terminology between health care professionals and social media texts coming from the lay public. To bridge this gap, we use sequence learning with recurrent neural networks and semantic representation of one- or multi-word expressions: we develop end-to-end architectures directly tailored to the task, including bidirectional Long Short-Term Memory, Gated Recurrent Units with an attention mechanism, and additional semantic similarity features based on UMLS. Our evaluation against a standard benchmark shows that recurrent neural networks improve results over an effective baseline for classification based on convolutional neural networks. A qualitative examination of mentions discovered in a dataset of user reviews collected from popular online health information platforms as well as a quantitative evaluation both show improvements in the semantic representation of health-related expressions in social media.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[3]  Nigel Collier,et al.  Normalising Medical Concepts in Social Media Texts by Learning Semantic Representation , 2016, ACL.

[5]  Elena Tutubalina,et al.  KFU at CLEF eHealth 2017 Task 1: ICD-10 Coding of English Death Certificates with Recurrent Neural Networks , 2017, CLEF.

[6]  Cécile Paris,et al.  Text and Data Mining Techniques in Adverse Drug Reaction Detection , 2015, ACM Comput. Surv..

[7]  Sanna Salanterä,et al.  Overview of the ShARe/CLEF eHealth Evaluation Lab 2013 , 2013, CLEF.

[8]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[9]  Goran Nenadic,et al.  Using an Ensemble of Linear and Deep Learning Models in the SMM4H 2017 Medical Concept Normalisation Task , 2017, SMM4H@AMIA.

[10]  Jürgen Schmidhuber,et al.  Recurrent nets that time and count , 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium.

[11]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[12]  Pascal Vincent,et al.  Representation Learning: A Review and New Perspectives , 2012, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Zhiyong Lu,et al.  Community challenges in biomedical text mining over 10 years: success, failure and the future , 2016, Briefings Bioinform..

[14]  Koldo Gojenola,et al.  On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions , 2015, J. Biomed. Informatics.

[15]  Peter L. Elkin,et al.  UMLS Concept Indexing for Production Databases: A Feasibility Study , 2001, J. Am. Medical Informatics Assoc..

[16]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[17]  Cynthia Brandt,et al.  Improving Patients' Electronic Health Record Comprehension with NoteAid , 2013, MedInfo.

[18]  Jian Yang,et al.  Towards Internet-Age Pharmacovigilance: Extracting Adverse Drug Reactions from User Posts in Health-Related Social Networks , 2010, BioNLP@ACL.

[19]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[20]  Sarvnaz Karimi,et al.  Cadec: A corpus of adverse drug event annotations , 2015, J. Biomed. Informatics.

[21]  Pierre Zweigenbaum,et al.  Hybrid methods for ICD-10 coding of death certificates , 2016, Louhi@EMNLP.

[22]  Julien Velcin,et al.  ECSTRA-INSERM @ CLEF eHealth2016-task 2: ICD10 Code Extraction from Death Certificates , 2016, CLEF.

[23]  Dan Roth,et al.  Entity Linking via Joint Encoding of Types, Descriptions, and Context , 2017, EMNLP.

[24]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[25]  Zhiyong Lu,et al.  NCBI at 2013 ShARe/CLEF eHealth Shared Task: Disorder Normalization in Clinical Notes with Dnorm , 2013, CLEF.

[26]  IdentIfyIng dIsease-related expressIons In revIews UsIng CondItIonal random fIelds , 2017 .

[27]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[28]  Roberto Navigli,et al.  Entity Linking meets Word Sense Disambiguation: a Unified Approach , 2014, TACL.

[29]  Tung Tran,et al.  Team UKNLP: Detecting ADRs, Classifying Medication Intake Messages, and Normalizing ADR Mentions on Twitter , 2017, SMM4H@AMIA.

[30]  Jina Huh,et al.  Detecting clinically related content in online patient posts , 2017, J. Biomed. Informatics.

[31]  Kevin Bretonnel Cohen,et al.  Biomedical Natural Language Processing , 2014 .

[32]  Erik M. van Mulligen,et al.  Erasmus MC at CLEF eHealth 2016: Concept Recognition and Coding in French Texts , 2016, CLEF.

[33]  Alexander Kotov,et al.  Social Media Analytics for Healthcare , 2015, Healthcare Data Analytics.

[34]  Abeed Sarker,et al.  Overview of the Second Social Media Mining for Health (SMM4H) Shared Tasks at AMIA 2017 , 2017, SMM4H@AMIA.

[35]  Jian Sun,et al.  Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[36]  P ? ? ? ? ? ? ? % ? ? ? ? , 1991 .

[37]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[38]  Abeed Sarker,et al.  Social Media Mining Shared Task Workshop , 2016, PSB.

[39]  Yoshua Bengio,et al.  Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling , 2014, ArXiv.

[40]  Yoon Kim,et al.  Convolutional Neural Networks for Sentence Classification , 2014, EMNLP.

[41]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[42]  Zhengya Sun,et al.  Multi-task Character-Level Attentional Networks for Medical Concept Normalization , 2018, Neural Processing Letters.

[43]  E. Brown,et al.  The Medical Dictionary for Regulatory Activities (MedDRA) , 1999, Drug safety.

[44]  Elena Tutubalina,et al.  Combination of Deep Recurrent Neural Networks and Conditional Random Fields for Extracting Adverse Drug Reactions from User Reviews , 2017, Journal of healthcare engineering.

[45]  Roberto Navigli,et al.  Neural Sequence Learning Models for Word Sense Disambiguation , 2017, EMNLP.

[46]  Houfeng Wang,et al.  Learning Entity Representation for Entity Disambiguation , 2013, ACL.

[47]  K. Bretonnel Cohen,et al.  CLEF eHealth 2017 Multilingual Information Extraction task Overview: ICD10 Coding of Death Certificates in English and French , 2017, CLEF.

[48]  Kent A. Spackman,et al.  SNOMED RT: a reference terminology for health care , 1997, AMIA.

[49]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[50]  John B. Goodenough,et al.  Contextual correlates of synonymy , 1965, CACM.

[51]  Sophia Ananiadou,et al.  Analysis of the effect of sentiment analysis on extracting adverse drug reactions from tweets and forum posts , 2016, J. Biomed. Informatics.

[52]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[53]  Pierre Zweigenbaum,et al.  LIMSI ICD10 coding Experiments on CépiDC Death Certificate Statements , 2016, CLEF.

[54]  Rohit J. Kate,et al.  UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns , 2014, *SEMEVAL.

[55]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[56]  P L Schuyler,et al.  The UMLS Metathesaurus: representing different views of biomedical concepts. , 1993, Bulletin of the Medical Library Association.

[57]  L. Biesecker,et al.  Mapping phenotypes to language: a proposal to organize and standardize the clinical descriptions of malformations , 2005, Clinical genetics.