Term identification in the biomedical literature

Sophisticated information technologies are needed for effective data acquisition and integration from a growing body of the biomedical literature. Successful term identification is key to getting access to the stored literature information, as it is the terms (and their relationships) that convey knowledge across scientific articles. Due to the complexities of a dynamically changing biomedical terminology, term identification has been recognized as the current bottleneck in text mining, and--as a consequence--has become an important research topic both in natural language processing and biomedical communities. This article overviews state-of-the-art approaches in term identification. The process of identifying terms is analysed through three steps: term recognition, term classification, and term mapping. For each step, main approaches and general trends, along with the major problems, are discussed. By assessing previous work in context of the overall term identification process, the review also tries to delineate needs for future work in the field.

[1]  A. J. Schroeder,et al.  The FlyBase database of the Drosophila Genome Projects and community literature. , 2002, Nucleic acids research.

[2]  Jeffrey T. Chang,et al.  Associating genes with gene ontology codes using a maximum entropy analysis of biomedical literature. , 2002, Genome research.

[3]  Maria Teresa Pazienza Information Extraction in the Web Era , 2003, Lecture Notes in Computer Science.

[4]  Hsin-Hsi Chen,et al.  Enhancing Performance of Protein Name Recognizers Using Collocation , 2003, BioNLP@ACL.

[5]  Vasileios Hatzivassiloglou,et al.  Disambiguating proteins, genes, and RNA in text: a machine learning approach , 2001, ISMB.

[6]  Jun'ichi Tsujii,et al.  Measuring Term Representativeness , 2002, SCIE.

[7]  Jerry R. Hobbs Information extraction from biomedical text , 2002, J. Biomed. Informatics.

[8]  C. Friedman,et al.  Using BLAST for identifying gene and protein names in journal articles. , 2000, Gene.

[9]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..

[10]  Yuji Matsumoto,et al.  Protein Name Tagging for Biomedical Annotation in Text , 2003, BioNLP@ACL.

[11]  Manabu Torii,et al.  An Investigation of Various Information Sources for Classifying Biological names , 2003, BioNLP@ACL.

[12]  Mark Craven,et al.  Constructing Biological Knowledge Bases by Extracting Information from Text Sources , 1999, ISMB.

[13]  Fredrik Olsson,et al.  Protein names and how to find them , 2002, Int. J. Medical Informatics.

[14]  Hongfang Liu,et al.  Pacific Symposium on Biocomputing 9:238-249(2004) BIOLOGICAL NOMENCLATURES: A SOURCE OF LEXICAL KNOWLEDGE AND AMBIGUITY , 2022 .

[15]  Jun'ichi Tsujii,et al.  Tuning support vector machines for biomedical named entity recognition , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[16]  Jun'ichi Tsujii,et al.  Boosting Precision and Recall of Dictionary-Based Protein Name Recognition , 2003, BioNLP@ACL.

[17]  Jin-Dong Kim,et al.  The GENIA corpus: an annotated research abstract corpus in molecular biology domain , 2002 .

[18]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[19]  Hong Yu,et al.  Automatically identifying gene/protein terms in MEDLINE abstracts , 2002, J. Biomed. Informatics.

[20]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[21]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[22]  Nigel Collier,et al.  Automatic Term Identification and Classification in Biology Texts. , 1999 .

[23]  Lorraine K. Tanabe,et al.  Tagging gene and protein names in biomedical text , 2002, Bioinform..

[24]  Betsy L. Humphreys,et al.  Technical Milestone: The Unified Medical Language System: An Informatics Research Collaboration , 1998, J. Am. Medical Informatics Assoc..

[25]  Peter Willett,et al.  Protein Structures and Information Extraction from Biological Texts: The PASTA System , 2003, Bioinform..

[26]  Goran Nenadic,et al.  Mining Biomedical Abstracts: What's in a Term? , 2004, IJCNLP.

[27]  Eytan Adar,et al.  SaRAD: a Simple and Robust Abbreviation Dictionary , 2004, Bioinform..

[28]  Gene Ontology Consortium The Gene Ontology (GO) database and informatics resource , 2003 .

[29]  K. Bretonnel Cohen,et al.  The Compositional Structure of Gene Ontology Terms , 2003, Pacific Symposium on Biocomputing.

[30]  Hongfang Liu,et al.  Research Paper: Automatic Resolution of Ambiguous Terms Based on Machine Learning and Conceptual Relations in the UMLS , 2002, J. Am. Medical Informatics Assoc..

[31]  Manabu Torii,et al.  Using Unlabeled MEDLINE Abstracts for Biological Named Entity Classification , 2002 .

[32]  R. Gaizauskas,et al.  Term Recognition and Classification in Biological Science Journal Articles , 1998 .

[33]  Goran Nenadic,et al.  Terminology-driven mining of biomedical literature , 2003, SAC '03.

[34]  E. Myers,et al.  Basic local alignment search tool. , 1990, Journal of molecular biology.

[35]  Eric Brill,et al.  A Simple Rule-Based Part of Speech Tagger , 1992, HLT.

[36]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[37]  K. E. Ravikumar,et al.  A Biological Named Entity Recognizer , 2002, Pacific Symposium on Biocomputing.

[38]  Jian Su,et al.  Effective Adaptation of Hidden Markov Model-based Named Entity Recognizer for Biomedical Domain , 2003, BioNLP@ACL.

[39]  Donna R. Maglott,et al.  RefSeq and LocusLink: NCBI gene-centered resources , 2001, Nucleic Acids Res..

[40]  K. Bretonnel Cohen,et al.  Contrast and variability in gene names , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[41]  Maria Jesus Martin,et al.  The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003 , 2003, Nucleic Acids Res..

[42]  Goran Nenadic,et al.  Using Domain-Specific Verbs for Term Classification , 2003, BioNLP@ACL.

[43]  Thomas C. Rindflesch,et al.  EDGAR: extraction of drugs, genes and relations from the biomedical literature. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[44]  Evelyne Tzoukermann,et al.  NLP for Term Variant Extraction: Synergy Between Morphology, Lexicon, and Syntax , 1999 .

[45]  Emmanuel Roche,et al.  Finite-State Language Processing , 1997 .

[46]  G Demetriou,et al.  Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[47]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[48]  James Pustejovsky,et al.  Extraction and Disambiguation of Acronym Meaning-Pairs in Medline , 2001 .

[49]  Goran Nenadic,et al.  Selecting Text Features for Gene Name Classification: from Documents to Terms , 2003, BioNLP@ACL.

[50]  Hideki Mima,et al.  Automatic recognition of multi-word terms:. the C-value/NC-value method , 2000, International Journal on Digital Libraries.

[51]  Douglas E. Appelt,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997, ArXiv.

[52]  Wanda Pratt,et al.  Better rules, fewer features: a semantic approach to selecting features from text , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[53]  Tomek Strzalkowski Natural Language Information Retrieval , 1995, Inf. Process. Manag..

[54]  Goran Nenadic,et al.  Automatic Acronym Acquisition and Term Variation Management within Domain-Specific Texts , 2002, LREC.

[55]  Jun'ichi Tsujii,et al.  Probabilistic term variant generator for biomedical terms , 2003, SIGIR.

[56]  Alexander A. Morgan,et al.  Gene Name Extraction Using FlyBase Resources , 2003, BioNLP@ACL.

[57]  C. Ouzounis,et al.  Automatic extraction of protein interactions from scientific abstracts. , 1999, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[58]  Sophia Ananiadou,et al.  Evaluation of Automatic Term Recognition of Nuclear Receptors from MEDLINE , 2000 .

[59]  Michael O'Connell,et al.  BioABACUS: a database of abbreviations and acronyms in biotechnology and computer science , 1998, Bioinform..

[60]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[61]  Yves Schabes,et al.  FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text , 1997 .

[62]  Nigel Collier,et al.  Extracting the Names of Genes and Gene Products with a Hidden Markov Model , 2000, COLING.

[63]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[64]  Lawrence Hunter,et al.  Mining molecular binding terminology from biomedical text , 1999, AMIA.

[65]  The FlyBase database of the Drosophila genome projects and community literature. , 2003, Nucleic acids research.

[66]  Didier Bourigault,et al.  LEXTER, a Natural Language Processing Tool for Terminology Extraction , 1996 .

[67]  Hong Yu,et al.  Extracting synonymous gene and protein terms from biological literature , 2003, ISMB.

[68]  James I. Garrels,et al.  The Yeast Protein Database (YPD): a curated proteome database for Saccharomyces cerevisiae , 1998, Nucleic Acids Res..

[69]  Gary Klein,et al.  The State of Cognitive Systems Engineering , 2002, IEEE Intell. Syst..

[70]  Proux,et al.  Detecting Gene Symbols and Names in Biological Texts: A First Step toward Pertinent Information Extraction. , 1998, Genome informatics. Workshop on Genome Informatics.

[71]  Hae-Chang Rim,et al.  Two-Phase Biomedical NE Recognition based on SVMs , 2003, BioNLP@ACL.

[72]  Sophia Ananiadou,et al.  A Methodology for Automatic Term Recognition , 1994, COLING.

[73]  Alexander A. Morgan,et al.  Rutabaga by any other name: extracting biological names , 2002, J. Biomed. Informatics.

[74]  Nigel Collier,et al.  Bio-Medical Entity Extraction using Support Vector Machines , 2005, Artif. Intell. Medicine.

[75]  Miguel A. Andrade-Navarro,et al.  Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families , 1998, Bioinform..