Raw Corpus Word Sense Disambiguation

A wide range of approaches have been applied to word sense disambiguation. However, most require manually crafted knowledge such as annotated text, machine readable dictionaries or thesari, semantic networks, or aligned bilingual corpora. The reliance on these knowledge sources limits portability since they generally exist only for selected domains and languages. This poster presents a corpus-based approach where multiple usages of an ambiguous word are divided into a specified number of sense groups based strictly on features that are automatically obtained from the immediately surrounding raw text. We are given N sentences, each of which contains a usage of a particular ambiguous word. Each sentence is converted into a feature vector (F1, F2,...,Fn, S) where (F1,...,Fn) represent the observed contextual properties of the sentence and S represents the unobserved sense of the ambiguous word. A probabilistic model is built from this data. First, a parametric form that describes the interactions among the observed contextual features and the unknown sense is specified. We use the form commonly known as Naive Bayes due to its favorable performance in previous studies of supervised disambiguation (e.g., Gale et. al., 1992, Mooney, 1996, Ng 1997). The Naive Bayes model, when applied to disambiguation, implies that all contextual features are conditionally independent given the sense of the ambiguous word: n