Low Cost Portability for Statistical Machine Translation based on N-gram Coverage

Statistical machine translation relies heavily on the available training data. However, in some cases, it is necessary to limit the amount of training data that can be created for or actually used by the systems. To solve that problem, we introduce a weighting scheme that tries to select more informative sentences first. This selection is based on the previously unseen n-grams the sentences contain, and it allows us to sort the sentences according to their estimated importance. After sorting, we can construct smaller training corpora, and we are able to demonstrate that systems trained on much less training data show a very competitive performance compared to baseline systems using all available training data.

[1]  Rebecca Hwa,et al.  Sample Selection for Statistical Parsing , 2004, CL.

[2]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[3]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[4]  Ying Zhang,et al.  An efficient phrase-to-phrase alignment model for arbitrarily long phrase and large corpora , 2005, EAMT.

[5]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[6]  Alex Waibel,et al.  Adaptation of the translation model for statistical machine translation based on information retrieval , 2005, EAMT.

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[9]  Ulrich Germann Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[10]  Chris Callison-Burch,et al.  Scaling Phrase-Based Statistical Machine Translation to Larger Corpora and Longer Phrases , 2005, ACL.

[11]  Lori Levin,et al.  A trainable transfer-based MT approach for languages with limited resources , 2004, EAMT.

[12]  Tony McEnery,et al.  Corpus Resources and Minority Language Engineering , 2000, LREC.

[13]  Gerard G. L. Meyer,et al.  Selective sampling of training data for speech recognition , 2002 .

[14]  Alexander H. Waibel,et al.  The ISL statistical translation system for spoken language translation , 2004, IWSLT.