Using MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles

Biomedical abbreviations and acronyms are widely used in biomedical literature. Since many of them represent important content in biomedical literature, information retrieval and extraction benefits from identifying the meanings of those terms. On the other hand, many abbreviations and acronyms are ambiguous, it would be important to map them to their full forms, which ultimately represent the meanings of the abbreviations. In this study, we present a semi-supervised method that applies MEDLINE as a knowledge source for disambiguating abbreviations and acronyms in full-text biomedical journal articles. We first automatically generated from the MEDLINE abstracts a dictionary of abbreviation-full pairs based on a rule-based system that maps abbreviations to full forms when full forms are defined in the abstracts. We then trained on the MEDLINE abstracts and predicted the full forms of abbreviations in full-text journal articles by applying supervised machine-learning algorithms in a semi-supervised fashion. We report up to 92% prediction precision and up to 91% coverage.

[1]  J. Pustejovsky,et al.  Medstract : Creating Large-scale Information Servers for biomedical libraries , 2002 .

[2]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[3]  Hongfang Liu,et al.  Mining Terminological Knowledge in Large Biomedical Corpora , 2003, Pacific Symposium on Biocomputing.

[4]  Michael Krauthammer,et al.  GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data , 2004, J. Biomed. Informatics.

[5]  David D. Lewis,et al.  Naive (Bayes) at Forty: The Independence Assumption in Information Retrieval , 1998, ECML.

[6]  B L Humphreys,et al.  The UMLS project: making the conceptual connection between users and the information they need. , 1993, Bulletin of the Medical Library Association.

[7]  Ian H. Witten,et al.  Managing Gigabytes: Compressing and Indexing Documents and Images , 1999 .

[8]  Ian H. Witten,et al.  Managing gigabytes (2nd ed.): compressing and indexing documents and images , 1999 .

[9]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[10]  W. Härdle Smoothing Techniques: With Implementation in S , 1991 .

[11]  Eytan Adar,et al.  SaRAD: a Simple and Robust Abbreviation Dictionary , 2004, Bioinform..

[12]  W. John Wilbur,et al.  Flexible Phrase Based Query Handling Algorithms. , 2001 .

[13]  C. Federiuk The effect of abbreviations on MEDLINE searching. , 1999, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[14]  Amit Singhal,et al.  Pivoted document length normalization , 1996, SIGIR 1996.

[15]  Michael O'Connell,et al.  BioABACUS: a database of abbreviations and acronyms in biotechnology and computer science , 1998, Bioinform..

[16]  Martin F. Porter,et al.  An algorithm for suffix stripping , 1997, Program.

[17]  Wei Luo,et al.  Medstract: creating large-scale information servers from biomedical texts , 2002, ACL Workshop on Natural Language Processing in the Biomedical Domain.

[18]  Mitchell P. Marcus,et al.  Maximum entropy models for natural language ambiguity resolution , 1998 .

[19]  James Pustejovsky,et al.  Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases , 2001, MedInfo.

[20]  Gerard Salton,et al.  Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer , 1989 .

[21]  Hong Yu,et al.  Automatically identifying gene/protein terms in MEDLINE abstracts , 2002, J. Biomed. Informatics.

[22]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[23]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[24]  Pat Langley,et al.  Induction of Selective Bayesian Classifiers , 1994, UAI.

[25]  L. Ohno-Machado Journal of Biomedical Informatics , 2001 .

[26]  W. John Wilbur,et al.  Boosting naïve Bayesian learning on a large subset of MEDLINE , 2000, AMIA.

[27]  H. D. Brunk,et al.  AN EMPIRICAL DISTRIBUTION FUNCTION FOR SAMPLING WITH INCOMPLETE INFORMATION , 1955 .

[28]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[29]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[30]  A T McCray,et al.  The Nature of Lexical Knowledge , 1998, Methods of Information in Medicine.

[31]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..

[32]  Hongfang Liu,et al.  Disambiguating Ambiguous Biomedical Terms in Biomedical Narrative Text: An Unsupervised Method , 2001, J. Biomed. Informatics.

[33]  Ian H. Witten,et al.  Using compression to identify acronyms in text , 2000, Proceedings DCC 2000. Data Compression Conference.

[34]  John C. Platt,et al.  Fast training of support vector machines using sequential minimal optimization, advances in kernel methods , 1999 .

[35]  Jong C. Park,et al.  Bidirectional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorial Grammar , 2000, Pacific Symposium on Biocomputing.

[36]  Pat Langley,et al.  An Analysis of Bayesian Classifiers , 1992, AAAI.