Linking entities through an ontology using word embeddings and syntactic re-ranking

BackgroundAlthough there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions.ResultsWe applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data.ConclusionsThe core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.

[1]  Arzucan Özgür,et al.  Ontology-Based Categorization of Bacteria and Habitat Entities using Information Retrieval Techniques , 2016, BioNLP.

[2]  Christian von Mering,et al.  RAIN: RNA–protein Association and Interaction Networks , 2017, Database J. Biol. Databases Curation.

[3]  Hyunju Lee,et al.  A method for named entity normalization in biomedical articles: application to diseases and plants , 2017, BMC Bioinformatics.

[4]  Luca Toldo,et al.  Extraction of potential adverse drug events from medical case reports , 2012, Journal of biomedical semantics.

[5]  Anand Kumar,et al.  Text mining and ontologies in biomedicine: Making sense of raw text , 2005, Briefings Bioinform..

[6]  Jari Björne,et al.  End-to-End System for Bacteria Habitat Extraction , 2017, BioNLP.

[7]  Helen Cook,et al.  A dictionary- and rule-based system for identification of bacteria and habitats in text , 2016, BioNLP.

[8]  Robert Bossy,et al.  BioNLP Shared Task 2011 - Bacteria Biotope , 2011, BioNLP@ACL.

[9]  Xiaolong Wang,et al.  CNN-based ranking for biomedical entity normalization , 2017, BMC Bioinformatics.

[10]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[11]  Alfonso Valencia,et al.  Overview of BioCreAtIvE: critical assessment of information extraction for biology , 2005, BMC Bioinformatics.

[12]  Martin Walker,et al.  Improving statistical inference on pathogen densities estimated by quantitative molecular methods: malaria gametocytaemia as a case study , 2015, BMC Bioinformatics.

[13]  Abeed Sarker,et al.  Pharmacovigilance from social media: mining adverse drug reaction mentions using sequence labeling with word embedding cluster features , 2015, J. Am. Medical Informatics Assoc..

[14]  Zhiyong Lu,et al.  Overview of the BioCreative III Workshop , 2011, BMC Bioinformatics.

[15]  Daniel L. Rubin,et al.  Biomedical ontologies: a functional perspective , 2007, Briefings Bioinform..

[16]  Vincent Ng,et al.  Sieve-Based Entity Linking for the Biomedical Domain , 2015, ACL.

[17]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[18]  E. Brown,et al.  The Medical Dictionary for Regulatory Activities (MedDRA) , 1999, Drug safety.

[19]  Rohit J. Kate,et al.  UWM: Disorder Mention Extraction from Clinical Text Using CRFs and Normalization Using Learned Edit Distance Patterns , 2014, *SEMEVAL.

[20]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[21]  Pablo M. Granitto,et al.  Clustering gene expression data with a penalized graph-based metric , 2011, BMC Bioinformatics.

[22]  Sampo Pyysalo,et al.  How to Train good Word Embeddings for Biomedical NLP , 2016, BioNLP@ACL.

[23]  Ganesh Bagler,et al.  A hierarchical anatomical classification schema for prediction of phenotypic side effects , 2018, PloS one.

[24]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[25]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[26]  Giovanni Ulivi,et al.  Combinatorial analysis and algorithms for quasispecies reconstruction using next-generation sequencing , 2011, BMC Bioinformatics.

[27]  William R. Hersh,et al.  A survey of current work in biomedical text mining , 2005, Briefings Bioinform..

[28]  Sunil Kumar Sahu,et al.  Evaluating distributed word representations for capturing semantics of biomedical concepts , 2015, BioNLP@IJCNLP.

[29]  Jeffrey Dean,et al.  Distributed Representations of Words and Phrases and their Compositionality , 2013, NIPS.

[30]  Zhiyong Lu,et al.  BioCreative III interactive task: an overview , 2011, BMC Bioinformatics.

[31]  K. Bretonnel Cohen,et al.  Text mining for the biocuration workflow , 2012, Database J. Biol. Databases Curation.

[32]  Zhiyong Lu,et al.  BioCreative-2012 Virtual Issue , 2012, Database J. Biol. Databases Curation.

[33]  Hung-Yu Kao,et al.  Cross-species gene normalization by species inference , 2011, BMC Bioinformatics.

[34]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[35]  A Valencia,et al.  An Overview of BioCreative II.5 , 2010, IEEE/ACM Transactions on Computational Biology and Bioinformatics.

[36]  Alfonso Valencia,et al.  Information extraction in molecular biology , 2002, Briefings Bioinform..

[37]  Sampo Pyysalo,et al.  Overview of BioNLP Shared Task 2013 , 2013, BioNLP@ACL.

[38]  K. Cohen,et al.  Overview of BioCreative II gene normalization , 2008, Genome Biology.

[39]  Cyril Grouin,et al.  Identification of Mentions and Relations between Bacteria and Biotope from PubMed Abstracts , 2016, BioNLP.

[40]  Michael Schroeder,et al.  Inter-species normalization of gene mentions with GNAT , 2008, ECCB.

[41]  A. M. Collier,et al.  The etiologic and epidemiologic spectrum of bronchiolitis in pediatric practice. , 1979, The Journal of pediatrics.

[42]  Zhiyong Lu,et al.  The gene normalization task in BioCreative III , 2011, BMC Bioinformatics.

[43]  Jiawei Han,et al.  Entity Linking with a Knowledge Base: Issues, Techniques, and Solutions , 2015, IEEE Transactions on Knowledge and Data Engineering.

[44]  Sampo Pyysalo,et al.  Overview of BioNLP’09 Shared Task on Event Extraction , 2009, BioNLP@HLT-NAACL.

[45]  Arzucan Özgür,et al.  Automatic query generation using word embeddings for retrieving passages describing experimental methods , 2017, Database J. Biol. Databases Curation.

[46]  Arzucan Özgür,et al.  Detection and categorization of bacteria habitats using shallow linguistic analysis , 2015, BMC Bioinformatics.

[47]  Juliane Fluck,et al.  ProMiner: Recognition of Human Gene and Protein Names using regularly updated Dictionaries , 2007 .

[48]  Robert Bossy,et al.  Overview of the gene regulation network and the bacteria biotope tasks in BioNLP'13 shared task , 2015, BMC Bioinformatics.

[49]  Pierre Zweigenbaum,et al.  Representation of complex terms in a vector space structured by an ontology for a normalization task , 2017, BioNLP.

[50]  Tapio Salakoski,et al.  Distributional Semantics Resources for Biomedical Text Processing , 2013 .

[51]  Zhiyong Lu,et al.  BioCreative-IV virtual issue , 2014, Database J. Biol. Databases Curation.

[52]  Louise Deléger,et al.  Overview of the Bacteria Biotope Task at BioNLP Shared Task 2016 , 2016, BioNLP.