Exploiting alignment techniques in MATREX: the DCU machine translation system for IWSLT 2008

In this paper, we give a description of the machine translati on (MT) system developed at DCU that was used for our third participation in the evaluation campaign of the Internatio nal Workshop on Spoken Language Translation (IWSLT 2008). In this participation, we focus on various techniques for wo rd and phrase alignment to improve system quality. Specifically, we try out our word packing and syntax-enhanced word alignment techniques for the Chinese‐English task and for the English‐Chinese task for the first time. For all translation tasks except Arabic‐English, we exploit li nguistically motivated bilingual phrase pairs extracted fr om parallel treebanks. We smooth our translation tables with out-of-domain word translations for the Arabic‐English and Chinese‐English tasks in order to solve the problem of the high number of out of vocabulary items. We also carried out experiments combining both in-domain and out-of-domain data to improve system performance and, finally, we deploy a majority voting procedure combining a language modelbased method and a translation-based method for case and punctuation restoration. We participated in all the transl ation tasks and translated both the single-best ASR hypotheses and the correct recognition results. The translation results c onfirm that our new word and phrase alignment techniques are often helpful in improving translation quality, and the dat a combination method we proposed can significantly improve system performance.

[1]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Andy Way,et al.  Exploiting Parallel Treebanks to Improve Phrase-Based Statistical Machine Translation , 2009, CICLing.

[4]  Yanjun Ma,et al.  Improving Word Alignment Using Syntactic Dependencies , 2008, SSST@ACL.

[5]  Daniel M. Bikel,et al.  Design of a multi-lingual, parallel-processing statistical parsing engine , 2002 .

[6]  Young-Suk Lee,et al.  IBM Arabic-to-English translation for IWSLT 2006 , 2006, IWSLT.

[7]  Daniel Jurafsky,et al.  Automatic Tagging of Arabic Text: From Raw Text to Base Phrase Chunks , 2004, NAACL.

[8]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[9]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[10]  Andy Way,et al.  MATREX: DCU machine translation system for IWSLT 2006. , 2006, IWSLT.

[11]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[12]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[13]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[14]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[15]  Montserrat Civit,et al.  Building Cast3LB: A Spanish Treebank , 2004 .

[16]  Andy Way,et al.  Robust language pair-independent sub-tree alignment , 2007, MTSUMMIT.

[17]  Andy Way,et al.  MaTrEx: machine translation using examples , 2006 .

[18]  Yanjun Ma,et al.  MaTrEx: the DCU machine translation system for IWSLT 2007 , 2007, IWSLT.

[19]  Mary Hearne,et al.  Comparing Constituency and Dependency Representations for SMT Phrase-Extraction , 2008, JEPTALNRECITAL.

[20]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[21]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[22]  Salim Roukos,et al.  A Maximum Entropy Word Aligner for Arabic-English Machine Translation , 2005, HLT.

[23]  Dan Klein,et al.  Improved Inference for Unlexicalized Parsing , 2007, NAACL.

[24]  Yuji Matsumoto,et al.  Automatic Extraction of Word Sequence Correspondences in Parallel Corpora , 1996, VLC@COLING.