论文信息 - Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora

Statistical Machine Translation with Word- and Sentence-Aligned Parallel Corpora

The parameters of statistical translation models are typically estimated from sentence-aligned parallel corpora. We show that significant improvements in the alignment and translation quality of such models can be achieved by additionally including word-aligned data during training. Incorporating word-level alignments into the parameter estimation of the IBM models reduces alignment error rate and increases the Bleu score when compared to training the same models only on sentence-aligned data. On the Verbmobil data set, we attain a 38% reduction in the alignment error rate and a higher Bleu score with half as many training examples. We discuss how varying the ratio of word-aligned to sentence-aligned data affects the expected performance gain.

[1] D. Rubin,et al. Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[2] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[3] I. Dan Melamed,et al. Manual Annotation of Translational Equivalence: The Blinker Project , 1998, ArXiv.

[4] Adrian Corduneanu,et al. Stable Mixing of Complete and Incomplete Information , 2014 .

[5] Ulrich Germann. Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[6] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7] Ted Pedersen,et al. An Evaluation Exercise for Word Alignment , 2003, ParallelTexts@NAACL-HLT.

[8] Noah A. Smith,et al. The Web as a Parallel Corpus , 2003, CL.

[9] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.

[10] Hermann Ney,et al. A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[11] EstimationPeter,et al. The Mathematics of Machine Translation : Parameter , 2004 .

[12] Sebastian Thrun,et al. Text Classification from Labeled and Unlabeled Documents using EM , 2000, Machine Learning.