Abbreviation definition identification based on automatic precision estimates

BackgroundThe rapid growth of biomedical literature presents challenges for automatic text processing, and one of the challenges is abbreviation identification. The presence of unrecognized abbreviations in text hinders indexing algorithms and adversely affects information retrieval and extraction. Automatic abbreviation definition identification can help resolve these issues. However, abbreviations and their definitions identified by an automatic process are of uncertain validity. Due to the size of databases such as MEDLINE only a small fraction of abbreviation-definition pairs can be examined manually. An automatic way to estimate the accuracy of abbreviation-definition pairs extracted from text is needed. In this paper we propose an abbreviation definition identification algorithm that employs a variety of strategies to identify the most probable abbreviation definition. In addition our algorithm produces an accuracy estimate, pseudo-precision, for each strategy without using a human-judged gold standard. The pseudo-precisions determine the order in which the algorithm applies the strategies in seeking to identify the definition of an abbreviation.ResultsOn the Medstract corpus our algorithm produced 97% precision and 85% recall which is higher than previously reported results. We also annotated 1250 randomly selected MEDLINE records as a gold standard. On this set we achieved 96.5% precision and 83.2% recall. This compares favourably with the well known Schwartz and Hearst algorithm.ConclusionWe developed an algorithm for abbreviation identification that uses a variety of strategies to identify the most probable definition for an abbreviation and also produces an estimated accuracy of the result. This process is purely automatic.

[1]  Alan R. Aronson,et al.  Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program , 2001, AMIA.

[2]  Neil R. Smalheiser,et al.  ADAM: another database of abbreviations in MEDLINE , 2006, Bioinform..

[3]  James Pustejovsky,et al.  Automatic Extraction of Acronym-meaning Pairs from MEDLINE Databases , 2001, MedInfo.

[4]  Russ B. Altman,et al.  Research Paper: Creating an Online Dictionary of Abbreviations from MEDLINE , 2002, J. Am. Medical Informatics Assoc..

[5]  George Hripcsak,et al.  Mapping abbreviations to full forms in biomedical articles. , 2002, Journal of the American Medical Informatics Association : JAMIA.

[6]  Youngja Park,et al.  Hybrid Text Mining for Finding Abbreviations and their Definitions , 2001, EMNLP.

[7]  Carol Friedman,et al.  A broad-coverage natural language processing system , 2000, AMIA.

[8]  Thomas C. Rindflesch,et al.  MedPost: a part-of-speech tagger for bioMedical text , 2004, Bioinform..

[9]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[10]  Philipp Koehn,et al.  Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing , 2001, EMNLP 2001.

[11]  C. Federiuk The effect of abbreviations on MEDLINE searching. , 1999, Academic emergency medicine : official journal of the Society for Academic Emergency Medicine.

[12]  T O Cheng,et al.  Acronyms of clinical trials in cardiology--1998. , 1999, American heart journal.

[13]  Hongfang Liu,et al.  Mining Terminological Knowledge in Large Biomedical Corpora , 2003, Pacific Symposium on Biocomputing.

[14]  T. Takagi,et al.  Toward information extraction: identifying protein names from biological papers. , 1998, Pacific Symposium on Biocomputing. Pacific Symposium on Biocomputing.

[15]  Toshihisa Takagi,et al.  PNAD-CSS: a workbench for constructing a protein name abbreviation dictionary , 2000, Bioinform..