Augmenting the Medical Subject Headings vocabulary with semantically rich variants to improve disease mention normalisation

We extended our existing methods for entity normalisation as part of our contribution to the Disease Named Entity Recognition and Normalisation subtask of the Chemical-Disease Relation (CDR) track of BioCreative V. Our newly proposed approach is based on the incorporation of semantics in two ways: (1) by adding corpus-derived semantic variants to the Medical Subject Headings (MeSH) vocabulary, and (2) through automatic translation of medical root words and affixes to potential variants. Results of the official evaluation of our methods show that the combination of both means for semantic enrichment gives us optimal performance on the disease name normalisation task, obtaining an F-score of 85.56%, with precision of 89.51% and recall of 81.94%. We have made our methods available in the form of a BioC-compliant Web service.

[1]  Thomas C. Wiegers,et al.  The Comparative Toxicogenomics Database's 10th year anniversary: update 2015 , 2014, Nucleic Acids Res..

[2]  Karin M. Verspoor,et al.  BioC: a minimalist approach to interoperability for biomedical text processing , 2013, AMIA.

[3]  Zhiyong Lu,et al.  DNorm: disease name normalization with pairwise learning to rank , 2013, Bioinform..

[4]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[5]  Alfonso Valencia,et al.  CHEMDNER: The drugs and chemical names extraction challenge , 2015, Journal of Cheminformatics.

[6]  Egon L. Willighagen,et al.  OSCAR4: a flexible architecture for chemical text-mining , 2011, J. Cheminformatics.

[7]  Yifan Peng,et al.  Assessing the state of the art in biomedical relation extraction: overview of the BioCreative V chemical-disease relation (CDR) task , 2016, Database J. Biol. Databases Curation.

[8]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[9]  A. Persidis,et al.  Drug repurposing and adverse event prediction using high‐throughput literature analysis , 2011, Wiley interdisciplinary reviews. Systems biology and medicine.

[10]  Jun'ichi Tsujii,et al.  Protein-protein interaction extraction by leveraging multiple kernels and parsers , 2009, Int. J. Medical Informatics.

[11]  Russ B Altman,et al.  Extracting and characterizing gene-drug relationships from the literature. , 2004, Pharmacogenetics.

[12]  D. Banville Mining chemical structural information from the drug literature. , 2006, Drug discovery today.

[13]  Gang Fu,et al.  Disease Ontology 2015 update: an expanded and updated database of human diseases for linking biomedical knowledge through disease data , 2014, Nucleic Acids Res..

[14]  Zhiyong Lu,et al.  PubTator: a web-based text mining tool for assisting biocuration , 2013, Nucleic Acids Res..

[15]  Pradeep Ravikumar,et al.  A Comparison of String Distance Metrics for Name-Matching Tasks , 2003, IIWeb.

[16]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[17]  Sophia Ananiadou,et al.  Developing a Robust Part-of-Speech Tagger for Biomedical Text , 2005, Panhellenic Conference on Informatics.

[18]  Carol A. Bocchini,et al.  A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®) , 2011, Human mutation.

[19]  Zhiyong Lu,et al.  Annotating chemicals , diseases and their interactions in biomedical literature , 2015 .

[20]  Andrew McCallum,et al.  Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data , 2001, ICML.

[21]  Alexander A. Morgan,et al.  BioCreAtIvE Task 1A: gene mention finding evaluation , 2005, BMC Bioinformatics.