Automatic Association of Web Directories with Word Senses

We describe an algorithm that combines lexical information (from WordNet 1.7) with Web directories (from the Open Directory Project) to associate word senses with such directories. Such associations can be used as rich characterizations to acquire sense-tagged corpora automatically, cluster topically related senses, and detect sense specializations. The algorithm is evaluated for the 29 nouns (147 senses) used in the Senseval 2 competition, obtaining 148 (word sense, Web directory) associations covering 88 of the domain-specific word senses in the test data with 86 accuracy. The richness of Web directories as sense characterizations is evaluated in a supervised word sense disambiguation task using the Senseval 2 test suite. The results indicate that, when the directory/word sense association is correct, the samples automatically acquired from the Web directories are nearly as valid for training as the original Senseval 2 training instances. The results support our hypothesis that Web directories are a rich source of lexical information: cleaner, more reliable, and more structured than the full Web as a corpus.

[1]  Bernardo Magnini,et al.  Integrating Subject Field Codes into WordNet , 2000, LREC.

[2]  Mark Sanderson,et al.  Retrieving descriptive phrases from large amounts of free text , 2000, CIKM '00.

[3]  Ted Pedersen Machine Learning with Lexical Features: The Duluth Approach to SENSEVAL-2 , 2001, SENSEVAL@ACL.

[4]  Mike Thelwall,et al.  Text characteristics of English language university Web sites , 2005, J. Assoc. Inf. Sci. Technol..

[5]  Adam Kilgarriff,et al.  English Lexical Sample Task Description , 2001, *SEMEVAL.

[6]  Xiaoyi Ma,et al.  BITS: a method for bilingual text search over the Web , 1999, MTSUMMIT.

[7]  George A. Miller,et al.  Introduction to WordNet: An On-line Lexical Database , 1990 .

[8]  Eneko Agirre,et al.  Exploring Automatic Word Sense Disambiguation with Decision Lists and the Web , 2000, SAIC@COLING.

[9]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[10]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[11]  Rada Mihalcea,et al.  A Method for Word Sense Disambiguation of Unrestricted Text , 1999, ACL.

[12]  Gregory Grefenstette,et al.  Web as Corpus , 2003 .

[13]  Olatz Ansa,et al.  Enriching very large ontologies using the WWW , 2000, ECAI Workshop on Ontology Learning.

[14]  Jian-Yun Nie,et al.  Multilingual Information Retrieval Based on Parallel Texts from the Web , 2000, CLEF.

[15]  Carol Peters,et al.  Evaluation of Cross-Language Information Retrieval Systems , 2002, Lecture Notes in Computer Science.

[16]  Scott Cotton,et al.  SENSEVAL-2: Overview , 2001, *SEMEVAL.

[17]  Tetsuya Ishikawa,et al.  Utilizing the World Wide Web as an Encyclopedia: Extracting Term Descriptions from Semi-Structured Texts , 2000, ACL.

[18]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[19]  Julio Gonzalo,et al.  Indexing with WordNet synsets can improve text retrieval , 1998, WordNet@ACL/COLING.

[20]  Adam Kilgarriff,et al.  Introduction to the Special Issue on SENSEVAL , 2000, Comput. Humanit..

[21]  Carlo Strapparava,et al.  Experiments in Word Domain Disambiguation for Parallel Texts , 2000, ACL 2000.