Finding Predominant Word Senses in Untagged Text

In word sense disambiguation (WSD), the heuristic of choosing the most common sense is extremely powerful because the distribution of the senses of a word is often skewed. The problem with using the predominant, or first sense heuristic, aside from the fact that it does not take surrounding context into account, is that it assumes some quantity of hand-tagged data. Whilst there are a few hand-tagged corpora available for some languages, one would expect the frequency distribution of the senses of words, particularly topical words, to depend on the genre and domain of the text under consideration. We present work on the use of a thesaurus acquired from raw textual corpora and the WordNet similarity package to find predominant noun senses automatically. The acquired predominant senses give a precision of 64% on the nouns of the SENSEVAL-2 English all-words task. This is a very promising result given that our method does not require any hand-tagged text, such as SemCor. Furthermore, we demonstrate that our method discovers appropriate predominant senses for words from two domain-specific corpora.

[1]  Carlo Strapparava,et al.  Using Domain Information for Word Sense Disambiguation , 2001, *SEMEVAL.

[2]  Lluís Padró,et al.  Mapping WordNets Using Structural Information , 2000, ACL.

[3]  Paola Merlo,et al.  Automatic distinction of arguments and modifiers: the case of prepositional phrases , 2001, CoNLL.

[4]  Patrick Pantel,et al.  Discovering word senses from text , 2002, KDD.

[5]  Anna Korhonen,et al.  Semantically Motivated Subcategorization Acquisition , 2002, ACL 2002.

[6]  Mirella Lapata,et al.  Verb Class Disambiguation Using Informative Priors , 2004, CL.

[7]  Christiane Fellbaum,et al.  English Tasks: All-Words and Verb Lexical Sample , 2001, *SEMEVAL.

[8]  David W. Conrath,et al.  Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy , 1997, ROCLING/IJCLCLP.

[9]  Yorick Wilks,et al.  The grammar of sense: Using part-of-speech tags as a first step in semantic disambiguation , 1998, Natural Language Engineering.

[10]  David Yarowsky,et al.  Evaluating sense disambiguation across diverse parameter spaces , 2002, Natural Language Engineering.

[11]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[12]  George A. Miller,et al.  A Semantic Concordance , 1993, HLT.

[13]  Ted Pedersen,et al.  Using Measures of Semantic Relatedness for Word Sense Disambiguation , 2003, CICLing.

[14]  Mark Stevenson,et al.  The Reuters Corpus Volume 1 -from Yesterday’s News to Tomorrow’s Language Resources , 2002, LREC.

[15]  Diana McCarthy Word Sense Disambiguation for Acquisition of Selectional Preferences , 1997 .

[16]  Julie Weeds,et al.  Using automatically acquired predominant senses for Word Sense Disambiguation , 2004, SENSEVAL@ACL.

[17]  David J. Weir,et al.  Characterising Measures of Lexical Distributional Similarity , 2004, COLING.

[18]  Walter Daelemans,et al.  Classifier Optimization and Combination in the English All Words Task , 2001, *SEMEVAL.

[19]  Paul Buitelaar,et al.  Ranking and Selecting Synsets by Domain Relevance , 2001 .

[20]  Dekang Lin,et al.  Automatic Retrieval and Clustering of Similar Words , 1998, ACL.

[21]  Massimiliano Ciaramita,et al.  Supersense Tagging of Unknown Nouns in WordNet , 2003, EMNLP.

[22]  Ted Briscoe,et al.  Robust Accurate Statistical Annotation of General Text , 2002, LREC.

[23]  Ted Pedersen,et al.  An Adapted Lesk Algorithm for Word Sense Disambiguation Using WordNet , 2002, CICLing.

[24]  Ted Pedersen,et al.  The cpan wordnet::similarity package , 2003 .

[25]  Beth Levin,et al.  English Verb Classes and Alternations: A Preliminary Investigation , 1993 .