Low-Cost Enrichment of Spanish WordNet with Automatically Translated Glosses: Combining General and Specialized Models

This paper studies the enrichment of Spanish WordNet with synset glosses automatically obtained from the English Word-Net glosses using a phrase-based Statistical Machine Translation system. We construct the English-Spanish translation system from a parallel corpus of proceedings of the European Parliament, and study how to adapt statistical models to the domain of dictionary definitions. We build specialized language and translation models from a small set of parallel definitions and experiment with robust manners to combine them. A statistically significant increase in performance is obtained. The best system is finally used to generate a definition for all Spanish synsets, which are currently ready for a manual revision. As a complementary issue, we analyze the impact of the amount of in-domain data needed to improve a system trained entirely on out-of-domain data.

[1]  John Cocke,et al.  A Statistical Approach to Language Translation , 1988, COLING.

[2]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3]  A. Sánchez,et al.  Gran diccionario de la lengua española , 1985 .

[4]  William H. Press,et al.  Numerical recipes in C. The art of scientific computing , 1987 .

[5]  Stephan Vogel,et al.  Improving statistical machine translation for a speech-to-speech translation task , 2002, INTERSPEECH.

[6]  Franz Josef Och,et al.  Statistical machine translation: from single word models to alignment templates , 2002 .

[7]  Philipp Koehn,et al.  Pharaoh: A Beam Search Decoder for Phrase-Based Statistical Machine Translation Models , 2004, AMTA.

[8]  Christiane Fellbaum,et al.  Book Reviews: WordNet: An Electronic Lexical Database , 1999, CL.

[9]  C. Fellbaum An Electronic Lexical Database , 1998 .

[10]  I. Dan Melamed,et al.  Precision and Recall of Machine Translation , 2003, NAACL.

[11]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[12]  María Antonia Martí Antonín,et al.  Gran diccionario de la lengua española , 2007 .

[13]  Eduard H. Hovy,et al.  The Use of External Knowledge of Factoid QA , 2001, TREC.

[14]  Manuel Alvar Ezquerra,et al.  Diccionario actual de la lengua española , 1990 .

[15]  Rada Mihalcea,et al.  An Automatic Method for Generating Sense Tagged Corpora , 1999, AAAI/IAAI.

[16]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[17]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[18]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[19]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Piek Vossen,et al.  The MEANING Multilingual Central Repository , 2004 .