MedType: Improving Medical Entity Linking with Semantic Type Prediction

Medical entity linking is the task of identifying and standardizing medical concepts referred to in an unstructured text. Most of the existing methods adopt a three-step approach of (1) detecting mentions, (2) generating a list of candidate concepts, and finally (3) picking the best concept among them. In this paper, we probe into alleviating the problem of overgeneration of candidate concepts in the candidate generation module, the most under-studied component of medical entity linking. For this, we present MedType, a fully modular system that prunes out irrelevant candidate concepts based on the predicted semantic type of an entity mention. We incorporate MedType into five off-the-shelf toolkits for medical entity linking and demonstrate that it consistently improves entity linking performance across several benchmark datasets. To address the dearth of annotated training data for medical entity linking, we present WikiMed and PubMedDS, two large-scale medical entity linking datasets, and demonstrate that pre-training MedType on these datasets further improves entity linking performance. We make our source code and datasets publicly available for medical entity linking research.

[1]  Rajarshi Das,et al.  Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks , 2017, ACL.

[2]  R G Mark,et al.  MIMIC II: a massive temporal ICU patient database to support research in intelligent patient monitoring , 2002, Computers in Cardiology.

[3]  Olivier Raiman,et al.  DeepType: Multilingual Entity Linking by Neural Type System Evolution , 2018, AAAI.

[4]  Yaoyun Zhang,et al.  UTH_CCB: A report for SemEval 2014 – Task 7 Analysis of Clinical Text , 2014, *SEMEVAL.

[5]  Roderic D. M. Page,et al.  Linking NCBI to Wikipedia: a wiki-based approach , 2011, PLoS currents.

[6]  Andrew McCallum,et al.  Hierarchical Losses and New Resources for Fine-grained Entity Typing and Linking , 2018, ACL.

[7]  Christopher G. Chute,et al.  Word sense disambiguation across two domains: Biomedical literature and clinical notes , 2008, J. Biomed. Informatics.

[8]  Ivor W. Tsang,et al.  Heterogeneous Domain Adaptation for Multiple Classes , 2014, AISTATS.

[9]  Stuart Adam Battersby,et al.  Experimenting with Distant Supervision for Emotion Classification , 2012, EACL.

[10]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[11]  Yasumasa Onoe,et al.  Fine-Grained Entity Typing for Domain Independent Entity Linking , 2020, AAAI.

[12]  Wei Ma,et al.  RxNorm: prescription for electronic drug information exchange , 2005, IT Professional.

[13]  Dina Demner-Fushman,et al.  MetaMap Lite: an evaluation of a new Java implementation of MetaMap , 2017, J. Am. Medical Informatics Assoc..

[14]  Marc Weeber,et al.  Developing a test collection for biomedical word sense disambiguation , 2001, AMIA.

[15]  Grace I. Paterson,et al.  Systematized nomenclature of medicine clinical terms (SNOMED CT) to represent computed tomography procedures , 2011, Comput. Methods Programs Biomed..

[16]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[17]  Miao Fan,et al.  Distant Supervision for Entity Linking , 2015, PACLIC.

[18]  Sylvie Ratté,et al.  Comparison of MetaMap and cTAKES for entity extraction in clinical notes , 2018, BMC Medical Informatics and Decision Making.

[19]  Hongfang Liu,et al.  CLAMP – a toolkit for efficiently building customized clinical natural language processing pipelines , 2017, J. Am. Medical Informatics Assoc..

[20]  Ivor W. Tsang,et al.  Learning With Augmented Features for Supervised and Semi-Supervised Heterogeneous Domain Adaptation , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[21]  Ying Lin,et al.  An Attentive Fine-Grained Entity Typing Model with Latent Type Representation , 2019, EMNLP.

[22]  Daniel Loureiro,et al.  MedLinker: Medical Entity Linking with Neural Representations and Dictionary Matching , 2020, ECIR.

[23]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[24]  Dan Klein,et al.  A Joint Model for Entity Analysis: Coreference, Typing, and Linking , 2014, TACL.

[25]  Berry de Bruijn,et al.  Recognizing UMLS Semantic Types with Deep Learning , 2019, LOUHI@EMNLP.

[26]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[27]  Eduardo P. Wiechmann,et al.  Tailoring Vocabularies for NLP in Sub-Domains: A Method to Detect Unused Word Sense , 2009, AMIA.

[28]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[29]  Maria Kvist,et al.  Rule-based Entity Recognition and Coverage of SNOMED CT in Swedish Clinical Text , 2012, LREC.

[30]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[31]  John F. Hurdle,et al.  Extracting Information from Textual Documents in the Electronic Health Record: A Review of Recent Research , 2008, Yearbook of Medical Informatics.

[32]  Sudeshna Sarkar,et al.  Medical Entity Linking using Triplet Network , 2019, Proceedings of the 2nd Clinical Natural Language Processing Workshop.

[33]  Alan R. Aronson,et al.  Exploiting a Large Thesaurus for Information Retrieval , 1994, RIAO.

[34]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[35]  Stan Matwin,et al.  deepBioWSD: effective deep neural word sense disambiguation of biomedical text data , 2019, J. Am. Medical Informatics Assoc..

[36]  Charlene R. Weir,et al.  Representation of Functional Status Concepts from Clinical Documents and Social Media Sources by Standard Terminologies , 2015, AMIA.

[37]  Luca Soldaini QuickUMLS: a fast, unsupervised approach for medical concept extraction , 2016 .

[38]  Li Zhou,et al.  Automated misspelling detection and correction in clinical free-text records , 2015, J. Biomed. Informatics.

[39]  Chang Wang,et al.  Heterogeneous Domain Adaptation Using Manifold Alignment , 2011, IJCAI.

[40]  Daniel S. Weld,et al.  Design Challenges for Entity Linking , 2015, TACL.

[41]  Praveen Paritosh,et al.  Freebase: a collaboratively created graph database for structuring human knowledge , 2008, SIGMOD Conference.

[42]  Jelena Jovanovic,et al.  Semantic annotation in biomedicine: the current landscape , 2017, Journal of Biomedical Semantics.

[43]  Christopher Potts,et al.  A large annotated corpus for learning natural language inference , 2015, EMNLP.

[44]  Sunghwan Sohn,et al.  Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications , 2010, J. Am. Medical Informatics Assoc..

[45]  Cynthia Brandt,et al.  Knowledge-based biomedical word sense disambiguation: an evaluation and application to clinical document classification , 2013, J. Am. Medical Informatics Assoc..

[46]  Daniel King,et al.  ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing , 2019, BioNLP@ACL.

[47]  Qing Zeng-Treitler,et al.  Exploring and developing consumer health vocabularies. , 2006, Journal of the American Medical Informatics Association : JAMIA.

[48]  Krzysztof J. Cios,et al.  Uniqueness of medical data mining , 2002, Artif. Intell. Medicine.

[49]  Zhiyong Lu,et al.  BioCreative V CDR task corpus: a resource for chemical disease relation extraction , 2016, Database J. Biol. Databases Curation.

[50]  Jonathan G. Fiscus,et al.  Overview of the NIST 2016 LoReHLT evaluation , 2017, Machine Translation.

[51]  Heike Adel,et al.  Noise Mitigation for Neural Entity Typing and Relation Extraction , 2016, EACL.

[52]  Mihai Surdeanu,et al.  Event Extraction Using Distant Supervision , 2014, LREC.

[53]  Donghui Li,et al.  MedMentions: A Large Biomedical Corpus Annotated with UMLS Concepts , 2019, AKBC.

[54]  Vasudeva Varma,et al.  ELDEN: Improved Entity Linking Using Densified Knowledge Graphs , 2018, NAACL-HLT.

[55]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[56]  Markus Krötzsch,et al.  Wikidata , 2014, Commun. ACM.

[57]  Hongyu Guo,et al.  Dynamic Graph Convolutional Networks for Entity Linking , 2020, WWW.

[58]  Thomas C. Wiegers,et al.  MEDIC: a practical disease vocabulary used at the Comparative Toxicogenomics Database , 2012, Database J. Biol. Databases Curation.

[59]  Matthew Scotch,et al.  The Yale cTAKES extensions for document classification: architecture and application , 2011, J. Am. Medical Informatics Assoc..

[60]  Sergey I. Nikolenko,et al.  Medical concept normalization in social media posts with recurrent neural networks , 2018, J. Biomed. Informatics.

[61]  Daniel Jurafsky,et al.  Distant supervision for relation extraction without labeled data , 2009, ACL.

[62]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[63]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[64]  Tiansi Dong,et al.  Fine-Grained Entity Typing via Hierarchical Multi Graph Convolutional Networks , 2019, EMNLP/IJCNLP.

[65]  Omer Levy,et al.  Ultra-Fine Entity Typing , 2018, ACL.

[66]  Maosong Sun,et al.  ERNIE: Enhanced Language Representation with Informative Entities , 2019, ACL.

[67]  Zita Marinho,et al.  Joint Learning of Named Entity Recognition and Entity Linking , 2019, ACL.

[68]  Maria Leonor Pacheco,et al.  of the Association for Computational Linguistics: , 2001 .

[69]  Chiranjib Bhattacharyya,et al.  RESIDE: Improving Distantly-Supervised Neural Relation Extraction using Side Information , 2018, EMNLP.

[70]  Fei Wang,et al.  A Neural Multi-Task Learning Framework to Jointly Model Medical Named Entity Recognition and Normalization , 2018, AAAI.

[71]  Eric Fosler-Lussier,et al.  Jointly Embedding Entities and Text with Distant Supervision , 2018, Rep4NLP@ACL.

[72]  Alicia Pérez,et al.  Recent advances in Swedish and Spanish medical entity recognition in clinical texts using deep neural approaches , 2019, BMC Medical Informatics and Decision Making.

[73]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[74]  Suresh Manandhar,et al.  SemEval-2014 Task 7: Analysis of Clinical Text , 2014, *SEMEVAL.

[75]  Sanghee Oh,et al.  Consumers’ Use of UMLS Concepts on Social Media: Diabetes-Related Textual Data Analysis in Blog and Social Q&A Sites , 2016, JMIR medical informatics.

[76]  Thomas Hofmann,et al.  End-to-End Neural Entity Linking , 2018, CoNLL.