An Unsupervised Method for Word Sense Tagging using Parallel Corpora

We present an unsupervised method for word sense disambiguation that exploits translation correspondences in parallel corpora. The technique takes advantage of the fact that cross-language lexicalizations of the same concept tend to be consistent, preserving some core element of its semantics, and yet also variable, reflecting differing translator preferences and the influence of context. Working with parallel corpora introduces an extra complication for evaluation, since it is difficult to find a corpus that is both sense tagged and parallel with another language; therefore we use pseudo-translations, created by machine translation systems, in order to make possible the evaluation of the approach against a standard test set. The results demonstrate that word-level translation correspondences are a valuable source of information for sense disambiguation.

[1]  R. Burchfield Frequency Analysis of English Usage: Lexicon and Grammar. By W. Nelson Francis and Henry Kučera with the assistance of Andrew W. Mackie. Boston: Houghton Mifflin. 1982. x + 561 , 1985 .

[2]  Robert L. Mercer,et al.  A Statistical Approach to Sense Disambiguation in Machine Translation , 1991, HLT.

[3]  Ido Dagan Lexical Disambiguation: Sources of Information and their Statistical Realization , 1991, ACL.

[4]  David Yarowsky,et al.  Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[5]  H. Schütze,et al.  Dimensions of meaning , 1992, Supercomputing '92.

[6]  Eric Brill,et al.  A corpus-based approach to language learning , 1993 .

[7]  David Yarowsky,et al.  One Sense per Collocation , 1993, HLT.

[8]  Janyce Wiebe,et al.  A New Approach to Word Sense Disambiguation , 1994, HLT.

[9]  Alon Itai,et al.  Word Sense Disambiguation Using a Second Language Monolingual Corpus , 1994, CL.

[10]  David Yarowsky,et al.  Unsupervised Word Sense Disambiguation Rivaling Supervised Methods , 1995, ACL.

[11]  Philip Resnik,et al.  Selectional Preference and Sense Disambiguation , 1997 .

[12]  Piek T. J. M. Vossen,et al.  The Linguistic Design of the EuroWordNet Database , 1998, Comput. Humanit..

[13]  Piek T. J. M. Vossen,et al.  The Top-Down Strategy for Building EuroWordNet: Vocabulary Coverage, Base Concepts and Top Ontology , 1998, Comput. Humanit..

[14]  Philip Resnik,et al.  Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language , 1999, J. Artif. Intell. Res..

[15]  ResnikPhilip,et al.  Distinguishing systems and distinguishing senses: new evaluation methods for Word Sense Disambiguation , 1999 .

[16]  Jian-Yun Nie,et al.  Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web , 1999, SIGIR '99.

[17]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[18]  Julio Gonzalo,et al.  Towards a Universal Index of Meaning , 1999 .

[19]  Mona T. Diab,et al.  An Unsupervised Method for Multilingual Word Sense Tagging Using Parallel Corpora , 2000, ACL 2000.

[20]  Mona T. Diab,et al.  A statistical word-level translation model for comparable corpora , 2000 .

[21]  Kenneth C. Litkowski Senseval: The CL Research Experience , 2000, Comput. Humanit..

[22]  Eneko Agirre,et al.  Combining Supervised and Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation , 2000, Comput. Humanit..

[23]  Adam Kilgarriff,et al.  Framework and Results for English SENSEVAL , 2000, Comput. Humanit..

[24]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[25]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[26]  Dekang Lin Word Sense Disambiguation with a Similarity-Smoothed Case Library , 2000, Comput. Humanit..

[27]  Nancy Ide,et al.  © 1999 Kluwer Academic Publishers. Printed in the Netherlands Cross-lingual Sense Determination: Can It Work? , 2022 .

[28]  Philip Resnik,et al.  Tagger Evaluation Given Hierarchical Tag Sets , 2000, Comput. Humanit..