A Language Modeling Approach for Acronym Expansion Disambiguation

Nonstandard words such as proper nouns, abbreviations, and acronyms are a major obstacle in natural language text processing and information retrieval. Acronyms, in particular, are difficult to read and process because they are often domain-specific with high degree of polysemy. In this paper, we propose a language modeling approach for the automatic disambiguation of acronym senses using context information. First, a dictionary of all possible expansions of acronyms is generated automatically. The dictionary is used to search for all possible expansions or senses to expand a given acronym. The extracted dictionary consists of about 17 thousands acronym-expansion pairs defining 1,829 expansions from different fields where the average number of expansions per acronym was 9.47. Training data is automatically collected from downloaded documents identified from the results of search engine queries. The collected data is used to build a unigram language model that models the context of each candidate expansion. At the in-context expansion prediction phase, the relevance of acronym expansion candidates is calculated based on the similarity between the context of each specific acronym occurrence and the language model of each candidate expansion. Unlike other work in the literature, our approach has the option to reject to expand an acronym if it is not confident on disambiguation. We have evaluated the performance of our language modeling approach and compared it with tf-idf discriminative approach.

[1]  Waleed Ammar,et al.  ICE-TEA: In-Context Expansion and Translation of English Abbreviations , 2011, CICLing.

[2]  Takenobu Tokunaga,et al.  Automatic expansion of abbreviations by using context and character information , 2004, Inf. Process. Manag..

[3]  Manuel Zahariev Automatic sense disambiguation for acronyms , 2004, SIGIR '04.

[4]  W. Bruce Croft,et al.  A Language Modeling Approach to Information Retrieval , 1998, SIGIR Forum.

[5]  Marti A. Hearst,et al.  A Simple Algorithm for Identifying Abbreviation Definitions in Biomedical Text , 2002, Pacific Symposium on Biocomputing.

[6]  Mark Stevenson,et al.  Disambiguation of Biomedical Abbreviations , 2009, BioNLP@HLT-NAACL.

[7]  Xuedong Huang,et al.  Improved topic-dependent language modeling using information retrieval techniques , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[8]  Roberto Navigli,et al.  Word sense disambiguation: A survey , 2009, CSUR.

[9]  Hong Yu,et al.  A large scale, corpus-based approach for automatically disambiguating biomedical abbreviations , 2006, TOIS.

[10]  James C. Bezdek,et al.  An Integrated Framework for Generalized Nearest Prototype Classifier Design , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[11]  Silviu Cucerzan,et al.  Acronym-Expansion Recognition and Ranking on the Web , 2007, 2007 IEEE International Conference on Information Reuse and Integration.

[12]  Kazem Taghva,et al.  Recognizing acronyms and their definitions , 1999, International Journal on Document Analysis and Recognition.

[13]  Dietrich Rebholz-Schuhmann,et al.  BIOINFORMATICS ORIGINAL PAPER Data and text mining Resolving abbreviations to their senses in Medline , 2005 .