Continuous space language models for the IWSLT 2006 task

The language model of the target language plays an important role in statistical machine translation systems. In this work, we propose to use a new statistical language model that is based on a continuous representation of the words in the vocabulary. A neural network is used to perform the projection and the probability estimation. This kind of approach is in particular promising for tasks where a very limited amount of resources are available, like the BTEC corpus of tourism related questions. This language model is used in two state-of-the-art statistical machine translation systems that were developed by UPC for the 2006 IWSLT evaluation campaign: a phraseand an n-gram-based approach. An experimental evaluation for four different language pairs is provided (translation of Mandarin, Japanese, Arabic and Italian to English). The proposed method achieved improvements in the BLEU score of up to 3 points on the development data and of almost 2 points on the official test data.

[1]  Yoshua Bengio,et al.  A Neural Probabilistic Language Model , 2003, J. Mach. Learn. Res..

[2]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[3]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[4]  Ngram-based versus Phrase-based Statistical Machine Translation , 2005, IWSLT.

[5]  Taro Watanabe,et al.  The NTT statistical machine translation system for IWSLT2005 , 2005, IWSLT.

[6]  Alexander H. Waibel,et al.  The CMU statistical machine translation system for IWSLT 2005 , 2005, IWSLT.

[7]  José B. Mariño,et al.  Bilingual N-gram Statistical Machine Translation , 2005 .

[8]  Mei Yang,et al.  Improved Language Modeling for Statistical Machine Translation , 2005, ParallelText@ACL.

[9]  Microsoft Research Treelet Translation System: IWSLT Evaluation , 2005, IWSLT.

[10]  Eiichiro Sumita,et al.  Nobody is perfect: ATR’s hybrid approach to spoken language translation , 2005, IWSLT.

[11]  José B. Mariño,et al.  N-gram-based versus phrase-based statistical machine translation , 2005, IWSLT.

[12]  A. Gispert Linguistic tuple segmentation in ngram-ba , 2006 .

[13]  José B. Mariño,et al.  N-gram-based SMT System Enhanced with Reordering Patterns , 2006, WMT@HLT-NAACL.

[14]  José B. Mariño,et al.  TALP phrase-based system and TALP system combination for IWSLT 2006 , 2006, IWSLT.

[15]  Holger Schwenk,et al.  Continuous Space Language Models for Statistical Machine Translation , 2006, ACL.

[16]  Ian R. Lane,et al.  The UKA/CMU statistical machine translation system for IWSLT 2006 , 2006, IWSLT.

[17]  Holger Schwenk,et al.  Continuous space language models , 2007, Comput. Speech Lang..