A Comparative Study of Supervised Learning as Applied to Acronym Expansion in Clinical Reports

Electronic medical records (EMR) constitute a valuable resource of patient specific information and are increasingly used for clinical practice and research. Acronyms present a challenge to retrieving information from the EMR because many acronyms are ambiguous with respect to their full form. In this paper we perform a comparative study of supervised acronym disambiguation in a corpus of clinical notes, using three machine learning algorithms: the naïve Bayes classifier, decision trees and Support Vector Machines (SVMs). Our training features include part-of-speech tags, unigrams and bigrams in the context of the ambiguous acronym. We find that the combination of these feature types results in consistently better accuracy than when they are used individually, regardless of the learning algorithm employed. The accuracy of all three methods when using all features consistently approaches or exceeds 90%, even when the baseline majority classifier is below 50%.

[1]  David Yarowsky,et al.  A method for disambiguating word senses in a large corpus , 1992, Comput. Humanit..

[2]  Ted Pedersen,et al.  Abbreviation and Acronym Disambiguation in Clinical Discourse , 2005, AMIA.

[3]  Hongfang Liu,et al.  Evaluating the UMLS as a source of lexical knowledge for medical language processing , 2001, AMIA.

[4]  Hwee Tou Ng,et al.  Supervised Word Sense Disambiguation with Support Vector Machines and multiple knowledge sources , 2004, SENSEVAL@ACL.

[5]  A Thesis Combining Lexical and Syntactic Features for Supervised Word Sense Disambiguation , 2003 .

[6]  Ian Witten,et al.  Data Mining , 2000 .

[7]  Serguei V. S. Pakhomov Semi-Supervised Maximum Entropy Based Approach to Acronym and Abbreviation Normalization in Medical Texts , 2002, ACL.

[8]  Ted Pedersen,et al.  Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces , 2004, CoNLL.

[9]  Marti A. Hearst Noun Homograph Disambiguation Using Local Context in Large Text Corpora , 1991 .

[10]  Hongfang Liu,et al.  A study of abbreviations in the UMLS , 2001, AMIA.

[11]  Hongfang Liu,et al.  A study of abbreviations in MEDLINE abstracts , 2002, AMIA.

[12]  Raymond J. Mooney,et al.  Comparative Experiments on Disambiguating Word Senses: An Illustration of the Role of Bias in Machine Learning , 1996, EMNLP.

[13]  Ezra Black,et al.  An Experiment in Computational Discrimination of English Word Senses , 1988, IBM J. Res. Dev..

[14]  Ted Pedersen,et al.  A Decision Tree of Bigrams is an Accurate Predictor of Word Sense , 2001, NAACL.

[15]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[16]  Mark Hepple,et al.  Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based POS Taggers , 2000, ACL.

[17]  Ian H. Witten,et al.  Data mining: practical machine learning tools and techniques, 3rd Edition , 1999 .

[18]  Eric W. Ford,et al.  Predicting the adoption of electronic health records by physicians: when will health care be paperless? , 2006, Journal of the American Medical Informatics Association : JAMIA.