Identifying named entities from PubMed® for enriching semantic categories

BackgroundControlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®.ResultsWe here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”.ConclusionsAlthough biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature.

[1]  Noémie Elhadad,et al.  Unsupervised biomedical named entity recognition: Experiments with clinical and biological texts , 2013, J. Biomed. Informatics.

[2]  Tatiana A. Tatusova,et al.  Entrez Gene: gene-centered information at NCBI , 2004, Nucleic Acids Res..

[3]  Martijn J. Schuemie,et al.  Rewriting and suppressing UMLS terms for improved biomedical term identification , 2010, J. Biomed. Semant..

[4]  William T. Hole,et al.  Finding UMLS Metathesaurus concepts in MEDLINE , 2002, AMIA.

[5]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[6]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[7]  Naomie Salim,et al.  Chemical named entities recognition: a review on approaches and applications , 2014, Journal of Cheminformatics.

[8]  Les Dethlefsen,et al.  Differences in codon bias cannot explain differences in translational power among microbes , 2005, BMC Bioinformatics.

[9]  William R. Hersh,et al.  A Survey of Current Work in Biomedical Text Mining , 2005 .

[10]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..

[11]  Zhiyong Lu,et al.  NCBI disease corpus: A resource for disease name recognition and concept normalization , 2014, J. Biomed. Informatics.

[12]  Lawrence H. Smith,et al.  Finding related sentence pairs in MEDLINE , 2010, Information Retrieval.

[13]  Andrew Cumming,et al.  Time-series Explorer: An Animated Information Visualisation for Microarray Time-course Data , 2005, BMC Bioinformatics.

[14]  W. John Wilbur,et al.  Classifying protein-protein interaction articles using word and syntactic features , 2011, BMC Bioinformatics.

[15]  Siddhartha Jonnalagadda,et al.  Enhancing clinical concept extraction with distributional semantics , 2012, J. Biomed. Informatics.

[16]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[17]  Karin M. Verspoor,et al.  BioLemmatizer: a lemmatization tool for morphological processing of biomedical text , 2012, J. Biomed. Semant..

[18]  Jun'ichi Tsujii,et al.  GENIA corpus - a semantically annotated corpus for bio-textmining , 2003, ISMB.

[19]  Lorraine K. Tanabe,et al.  GENETAG: a tagged corpus for gene/protein named entity recognition , 2005, BMC Bioinformatics.

[20]  Olivier Bodenreider,et al.  Unsupervised,corpus-based method for extending a biomedical terminology , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[21]  Doug Downey,et al.  Unsupervised named-entity extraction from the Web: An experimental study , 2005, Artif. Intell..

[22]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[23]  Michael Krauthammer,et al.  Term identification in the biomedical literature , 2004, J. Biomed. Informatics.

[24]  W. John Wilbur,et al.  Flexible Phrase Based Query Handling Algorithms. , 2001 .

[25]  Dave Bridges,et al.  Cyclic nucleotide binding proteins in the Arabidopsis thaliana and Oryza sativa genomes , 2005, BMC Bioinformatics.

[26]  Qizhai Li,et al.  Robust joint analysis allowing for model uncertainty in two-stage genetic association studies , 2011, BMC Bioinformatics.

[27]  Olivier Bodenreider,et al.  Evaluating UMLS strings for natural language processing , 2001, AMIA.

[28]  Cui Tao,et al.  Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis , 2012, J. Am. Medical Informatics Assoc..

[29]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[30]  María Martín,et al.  Activities at the Universal Protein Resource (UniProt) , 2013, Nucleic Acids Res..

[31]  Marti A. Hearst Automatic Acquisition of Hyponyms from Large Text Corpora , 1992, COLING.

[32]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[33]  Carol Friedman,et al.  Semantic reclassification of the UMLS concepts , 2008, Bioinform..

[34]  Rong Xu,et al.  A Comprehensive Analysis of Five Million UMLS Metathesaurus Terms Using Eighteen Million MEDLINE Citations. , 2010, AMIA ... Annual Symposium proceedings. AMIA Symposium.

[35]  W. John Wilbur,et al.  Finding biomedical categories in Medline® , 2012, J. Biomed. Semant..

[36]  A. Valencia,et al.  Text-mining and information-retrieval services for molecular biology , 2005, Genome Biology.

[37]  Zhiyong Lu,et al.  Prioritizing PubMed articles for the Comparative Toxicogenomic Database utilizing semantic information , 2012, Database J. Biol. Databases Curation.

[38]  Lorraine K. Tanabe,et al.  SemCat: Semantically Categorized Entities for Genomics , 2006, AMIA.

[39]  Tong Zhang,et al.  Solving large scale linear prediction problems using stochastic gradient descent algorithms , 2004, ICML.

[40]  Lorraine K. Tanabe,et al.  A Priority Model for Named Entities , 2006, BioNLP@NAACL-HLT.

[41]  Fernando Pereira,et al.  Identifying gene and protein mentions in text using conditional random fields , 2005, BMC Bioinformatics.

[42]  Carol Friedman,et al.  Semantic classification of biomedical concepts using distributional similarity. , 2007, Journal of the American Medical Informatics Association : JAMIA.

[43]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[44]  Xiaohua Hu,et al.  MaxMatcher: Biological Concept Extraction Using Approximate Dictionary Lookup , 2006, PRICAI.

[45]  Dario A. Giuse,et al.  Development and evaluation of RapTAT: A machine learning system for concept mapping of phrases from medical narratives , 2014, J. Biomed. Informatics.

[46]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[47]  Dietrich Rebholz-Schuhmann,et al.  Biological network extraction from scientific literature: state of the art and challenges , 2014, Briefings Bioinform..

[48]  Karin M. Verspoor,et al.  Towards a Semantic Lexicon for Biological Language Processing , 2005, Comparative and functional genomics.

[49]  D. Lindberg,et al.  Unified Medical Language System , 2020, Definitions.

[50]  Maguelonne Teisseire,et al.  Towards a Mixed Approach to Extract Biomedical Terms from Text Corpus , 2014, Int. J. Knowl. Discov. Bioinform..

[51]  Masaki Murata,et al.  Gene/protein name recognition based on support vector machine using dictionary as features , 2005, BMC Bioinformatics.

[52]  D. Lindberg,et al.  The Unified Medical Language System , 1993, Methods of Information in Medicine.