A Relationship: Word Alignment, Phrase Table, and Translation Quality

In the last years, researchers conducted several studies to evaluate the machine translation quality based on the relationship between word alignments and phrase table. However, existing methods usually employ ad-hoc heuristics without theoretical support. So far, there is no discussion from the aspect of providing a formula to describe the relationship among word alignments, phrase table, and machine translation performance. In this paper, on one hand, we focus on formulating such a relationship for estimating the size of extracted phrase pairs given one or more word alignment points. On the other hand, a corpus-motivated pruning technique is proposed to prune the default large phrase table. Experiment proves that the deduced formula is feasible, which not only can be used to predict the size of the phrase table, but also can be a valuable reference for investigating the relationship between the translation performance and phrase tables based on different links of word alignment. The corpus-motivated pruning results show that nearly 98% of phrases can be reduced without any significant loss in translation quality.

[1]  Alexandre Allauzen,et al.  How good are your phrases? Assessing phrase quality with single class classification , 2011, IWSLT.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Christoph Tillmann,et al.  A Projection Extension Algorithm for Statistical Machine Translation , 2003, EMNLP.

[4]  Hermann Ney,et al.  A Comparison of Alignment Models for Statistical Machine Translation , 2000, COLING.

[5]  Hao Yu,et al.  Discarding monotone composed rule for hierarchical phrase-based statistical machine translation , 2009, IUCS '09.

[6]  Wang Ling,et al.  Entropy-based Pruning for Phrase-based Machine Translation , 2012, EMNLP.

[7]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[8]  Kenneth Ward Church,et al.  Identifying Word Correspondences in Parallel Texts , 1991, HLT.

[9]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[10]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[11]  Yuji Matsumoto,et al.  Learning Sequence-to-Sequence Correspondences from Parallel Corpora via Sequential Pattern Mining , 2003, ParallelTexts@NAACL-HLT.

[12]  Kenneth Ward Church,et al.  Identifying word correspondence in parallel texts , 1991 .

[13]  Alexander H. Waibel,et al.  Translation Model Pruning via Usage Statistics for Statistical Machine Translation , 2007, HLT-NAACL.

[14]  Yanjun Ma,et al.  What types of word alignment improve statistical machine translation? , 2012, Machine Translation.

[15]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[16]  Turchi Marco,et al.  How Good Are Your Phrases? Assessing Phrase Quality with Single Class Classification , 2011 .

[17]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[18]  Andreas Eisele,et al.  Improving Statistical Machine Translation Efficiency by Triangulation , 2008, LREC.

[19]  Stephan Vogel,et al.  A Generalized Alignment-Free Phrase Extraction , 2005, ParallelText@ACL.

[20]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[21]  Liang Tian,et al.  UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation , 2014, LREC.

[22]  Andreas Eisele,et al.  Intersecting Multilingual Data for Faster and Better Statistical Translations , 2009, HLT-NAACL.

[23]  Hermann Ney,et al.  AER: do we need to “improve” our alignments? , 2006, IWSLT.

[24]  Key-Sun Choi,et al.  Bilingual Knowledge Acquisition from Korean-English Parallel Corpus Using Alignment , 1996, COLING.

[25]  Qun Liu,et al.  Reducing SMT Rule Table with Monolingual Key Phrase , 2009, ACL/IJCNLP.

[26]  Peng Xu,et al.  A Systematic Comparison of Phrase Table Pruning Techniques , 2012, EMNLP.

[27]  Mei Yang,et al.  Toward Smaller, Faster, and Better Hierarchical Phrase-based SMT , 2009, ACL.

[28]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[29]  Min Zhang,et al.  Learning Phrase Translation using Level of Detail Approach , 2005, MTSUMMIT.

[30]  Germán Sanchis-Trilles,et al.  Bilingual segmentation for phrasetable pruning in Statistical Machine Translation , 2011, EAMT.

[31]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[32]  Taro Watanabe,et al.  An Unsupervised Model for Joint Phrase Alignment and Extraction , 2011, ACL.

[33]  Éric Gaussier,et al.  Aligning words using matrix factorisation , 2004, ACL.

[34]  Alexander M. Fraser,et al.  Squibs and Discussions: Measuring Word Alignment Quality for Statistical Machine Translation , 2007, CL.

[35]  Ying Zhang,et al.  Competitive Grouping in Integrated Phrase Segmentation and Alignment Model , 2005, ParallelText@ACL.

[36]  M. A. R T A P A L,et al.  The Penn Chinese TreeBank: Phrase structure annotation of a large corpus , 2005, Natural Language Engineering.

[37]  Nianwen Xue,et al.  Building a Large-Scale Annotated Chinese Corpus , 2002, COLING.

[38]  Ben Taskar,et al.  Better Alignments = Better Translations? , 2008, ACL.

[39]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[40]  Adam Lopez,et al.  Statistical machine translation , 2008, AMTA.

[41]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[42]  Nan Duan,et al.  Improving Phrase Extraction via MBR Phrase Scoring and Pruning , 2011, MTSUMMIT.

[43]  Thomas M. Cover,et al.  Elements of Information Theory , 2005 .

[44]  Liang Tian,et al.  An improvement of translation quality with adding key-words in parallel corpus , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[45]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[46]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[47]  John DeNero,et al.  A Class-Based Agreement Model for Generating Accurately Inflected Translations , 2012, ACL.

[48]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[49]  Sophia Ananiadou,et al.  Extracting Nested Collocations , 1996, COLING.

[50]  Esther Galbrun,et al.  Phrase table pruning for Statistical Machine Translation , 2010 .

[51]  Joel D. Martin,et al.  Improving Translation Quality by Discarding Most of the Phrasetable , 2007, EMNLP.

[52]  Howard. Johnson Conditional Significance Pruning: Discarding More of Huge Phrase Tables , 2012, AMTA.