论文信息 - Building a Bilingual Representation of the Roget Thesaurus for French to English Machine Translation

Building a Bilingual Representation of the Roget Thesaurus for French to English Machine Translation

This paper describes a solution to lexical transfer as a trade-off between a dictionary and an ontology. It shows its association to a translation tool based on morpho-syntactical parsing of the source language. It is based on the English Roget Thesaurus and its equivalent, the French Larousse Thesaurus, in a computational framework. Both thesaurii are transformed into vector spaces, and all monolingual entries are represented as vectors, with 1000 components for English and 873 for French. The indexing concepts of the respective thesaurii are the generation families of the vector spaces. A bilingual data structure transforms French entries into vectors in the English space, by using their equivalencies representations. Word sense disambiguation consists in choosing the appropriate vector among these 'bilingual' vectors, by computing the contextualized vector of a given word in its source sentence, wading it in the English vector space, and computing the closest distance to the different entries in the bilingual data structure beginning with the same source string (i.e. French word). The process has been experimented on a 20, 000 words extract of a French novel, Le Petit Prince, and lexical transfer results were found quite encouraging with a recall of 86% and a precision of 71%.

Violaine Prince | Jacques Chauché

[1] Ted Pedersen,et al. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts , 2006 .

[2] Yoshihiko Nitta,et al. Co-Occurrence Vectors From Corpora vs. Distance Vectors From Dictionaries , 1994, COLING.

[3] J. Chauché,et al. Un Outil Multidimensionnel de l’Analyse du Discours , 1984, ACL.

[4] Daniel Péchoin. Thésaurus Larousse : des idées aux mots, des mots aux idées , 1992 .

[5] David Yarowsky,et al. Word-Sense Disambiguation Using Statistical Models of Roget’s Categories Trained on Large Corpora , 2010, COLING.

[6] Michael McGill,et al. Introduction to Modern Information Retrieval , 1983 .

[7] Graeme Hirst,et al. Automatic Sense Disambiguation of the Near-Synonyms in a Dictionary Entry , 2003, CICLing.

[8] Yorick Wilks,et al. Providing machine tractable dictionary tools , 1990, Machine Translation.

[9] Yorick Wilks,et al. Language processing and the thesaurus , 1998 .

[10] Jon Oberlander,et al. IN PROCEEDINGS OF EACL-2006 , 2006 .

[11] B. V. Verghese,et al. Thesaurus of English Words and Phrases , 2002 .

[12] Thorsten Joachims,et al. Text Categorization with Support Vector Machines: Learning with Many Relevant Features , 1998, ECML.

[13] Richard A. Harshman,et al. Indexing by Latent Semantic Analysis , 1990, J. Am. Soc. Inf. Sci..