Knowledge-based and knowledge-lean methods combined in unsupervised word sense disambiguation

Word sense disambiguation (WSD) is an intermediate task within information retrieval and information extraction, attempting to select the proper sense of ambiguous words. For instance, the word cold could either refer to low temperature or viral infection. Due to the scarcity of training data, knowledge-based and knowledge-lean methods receive attention as disambiguation methods. Knowledge-based methods compare the context of the ambiguous word to the information available in a terminological resource, but their main purpose is not word sense disambiguation. Knowledge-lean unsupervised methods rely on term distributions instead of a resource enumerating the possible senses but might be inappropriate when there is a requirement to commit to a terminological resource as a catalog for candidate senses. We present preliminary results of the combination of knowledge-based and knowledge-lean unsupervised methods which improves the performance of knowledge-based methods between 3% and 8%. The evaluation is done on a new word sense disambiguation set which is available to the community.

[1]  Martijn J. Schuemie,et al.  Word Sense Disambiguation in the Biomedical Domain: An Overview , 2005, J. Comput. Biol..

[2]  Mark Stevenson,et al.  Acquiring Sense Tagged Examples using Relevance Feedback , 2008, COLING.

[3]  Antonio Jimeno-Yepes,et al.  Knowledge-based biomedical word sense disambiguation: comparison of approaches , 2010, BMC Bioinformatics.

[4]  Susanne M. Humphrey,et al.  The NLM Indexing Initiative's Medical Text Indexer , 2004, MedInfo.

[5]  Mirella Lapata,et al.  Bayesian Word Sense Induction , 2009, EACL.

[6]  Alan R. Aronson,et al.  An overview of MetaMap: historical perspective and recent advances , 2010, J. Am. Medical Informatics Assoc..

[7]  Ted Pedersen,et al.  Unsupervised Corpus-Based Methods for WSD , 2007 .

[8]  Michael E. Lesk,et al.  Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone , 1986, SIGDOC '86.

[9]  George A. Miller,et al.  Using Corpus Statistics and WordNet Relations for Sense Identification , 1998, CL.

[10]  Bridget T. McInnes,et al.  Exploiting MeSH indexing in MEDLINE to generate a data set for word sense disambiguation , 2011, BMC Bioinformatics.

[11]  Bridget T. McInnes An Unsupervised Vector Approach to Biomedical Term Disambiguation: Integrating UMLS and Medline , 2008, ACL.

[12]  Eneko Agirre,et al.  Personalizing PageRank for Word Sense Disambiguation , 2009, EACL.

[13]  Olivier Bodenreider,et al.  The Unified Medical Language System (UMLS): integrating biomedical terminology , 2004, Nucleic Acids Res..

[14]  Antonio Jimeno-Yepes,et al.  Query Expansion for UMLS Metathesaurus Disambiguation Based on Automatic Corpus Extraction , 2010, 2010 Ninth International Conference on Machine Learning and Applications.

[15]  Olivier Bodenreider,et al.  The NLM Indexing Initiative , 2000, AMIA.

[16]  Paul R. Cohen,et al.  Empirical methods for artificial intelligence , 1995, IEEE Expert.

[17]  Inderjit S. Dhillon,et al.  Concept Decompositions for Large Sparse Text Data Using Clustering , 2004, Machine Learning.

[18]  German Rigau,et al.  Supervised Corpus-based Methods for Word Sense Disambiguation , 2006 .