pyMeSHSim: an integrative python package to realize biomedical named entity recognition, normalization and comparison

Motivation Increasing disease causal genes have been identified through different methods, while there are still no uniform biomedical named entity (bio-NE) annotations of the disease phenotypes. Furthermore, semantic similarity comparison between two bio-NE annotations, like disease descriptions, has become important for data integration or system genetics analysis. Methods The package pyMeSHSim realizes bio-NEs recognition using MetaMap, which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to MeSH, pyMeSHSim embedded a house made dataset containing the Medical Subject Headings (MeSH) main headings (MHs), supplementary concept records (SCRs) and relations between them. Based on the dataset, pyMeSHSim implemented four information content (IC) based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. Results To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The inclusion of SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts used by pyMeSHSim improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of GWAS phenotypes, pyMeSHSim and previous manual work recognized the same MeSH terms from 276/461 GWAS phenotypes, and the correlation between their semantic similarity calculated by pyMeSHSim and another semantic analysis tool meshes was as high as 0.53-0.97. Conclusion With the embedded dataset including both MeSH MHs and SCRs, the integrative MeSH tool pyMeSHSim realized the disease recognition, normalization and comparison in biomedical text-mining. Availability Package’s source code and test datasets are available under the GPLv3 license at https://github.com/luozhhub/pyMeSHSim

[1]  Zhiyong Lu,et al.  Challenges in clinical natural language processing for automated disorder normalization , 2015, J. Biomed. Informatics.

[2]  K. Becker,et al.  The Genetic Association Database , 2004, Nature Genetics.

[3]  Antonio Jimeno-Yepes,et al.  Comparison and combination of several MeSH indexing approaches , 2013, AMIA.

[4]  Thomas Lengauer,et al.  A new measure for functional similarity of gene products based on Gene Ontology , 2006, BMC Bioinformatics.

[5]  Qing-Yu He,et al.  DOSE: an R/Bioconductor package for disease ontology semantic and enrichment analysis , 2015, Bioinform..

[6]  Satoru Miyazaki,et al.  MeSH ORA framework: R/Bioconductor packages to support MeSH over-representation analysis , 2015, BMC Bioinformatics.

[7]  Andrey Rzhetsky,et al.  DiseaseConnect: a comprehensive web server for mechanism-based disease–disease connections , 2014, Nucleic Acids Res..

[8]  Gang Feng,et al.  Disease Ontology: a backbone for disease semantic integration , 2011, Nucleic Acids Res..

[9]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[10]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[11]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[12]  Philip S. Yu,et al.  A new method to measure the semantic similarity of GO terms , 2007, Bioinform..

[13]  Damian Smedley,et al.  Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome , 2014, Science Translational Medicine.

[14]  Yue Zhao,et al.  MNDR v2.0: an updated resource of ncRNA–disease associations in mammals , 2017, Nucleic Acids Res..

[15]  Olivier Bodenreider,et al.  Aggregating UMLS Semantic Types for Reducing Conceptual Complexity , 2001, MedInfo.

[16]  Mulin Jun Li,et al.  Nature Genetics Advance Online Publication a N a Ly S I S the Support of Human Genetic Evidence for Approved Drug Indications , 2022 .

[17]  Philip Resnik,et al.  Using Information Content to Evaluate Semantic Similarity in a Taxonomy , 1995, IJCAI.

[18]  François Schiettecatte,et al.  OMIM.org: Online Mendelian Inheritance in Man (OMIM®), an online catalog of human genes and genetic disorders , 2014, Nucleic Acids Res..

[19]  X. Chen,et al.  TTD: Therapeutic Target Database , 2002, Nucleic Acids Res..

[20]  C E Lipscomb,et al.  Medical Subject Headings (MeSH). , 2000, Bulletin of the Medical Library Association.

[21]  B Marshall,et al.  Gene Ontology Consortium: The Gene Ontology (GO) database and informatics resource , 2004, Nucleic Acids Res..

[22]  Núria Queralt-Rosinach,et al.  DisGeNET: a comprehensive platform integrating information on human disease-associated genes and variants , 2016, Nucleic Acids Res..

[23]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[24]  Pak Chung Sham,et al.  GWASdb: a database for human genetic variants identified by genome-wide association studies , 2011, Nucleic Acids Res..

[25]  Jing Zhou,et al.  MeSHSim: An R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents , 2015, 2015 34th Chinese Control Conference (CCC).

[26]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[27]  Guangchuang Yu,et al.  Using meshes for MeSH term enrichment and semantic analyses , 2018, Bioinform..

[28]  David S. Wishart,et al.  DrugBank: a comprehensive resource for in silico drug discovery and exploration , 2005, Nucleic Acids Res..

[29]  Dekang Lin,et al.  An Information-Theoretic Definition of Similarity , 1998, ICML.

[30]  Eric Sayers,et al.  The E-utilities In-Depth: Parameters, Syntax and More , 2015 .

[31]  Ted Pedersen,et al.  UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity , 2009, AMIA.

[32]  Z. Cao,et al.  Mining drug–disease relationships as a complement to medical genetics‐based drug repositioning: Where a recommendation system meets genome‐wide association studies , 2015, Clinical pharmacology and therapeutics.