Bilingually Motivated Word Segmentation for Statistical Machine Translation

We introduce a bilingually motivated word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Our approach is motivated from the insight that PB-SMT systems can be improved by optimizing the input representation to reduce the predictive power of translation models. We firstly present an approach to optimize the existing segmentation of both source and target languages for PB-SMT and demonstrate the effectiveness of this approach using a Chinese--English MT task, that is, to measure the influence of the segmentation on the performance of PB-SMT systems. We report a 5.44% relative increase in Bleu score and a consistent increase according to other metrics. We then generalize this method for Chinese word segmentation without relying on any segmenters and show that using our segmentation PB-SMT can achieve more consistent state-of-the-art performance across two domains. There are two main advantages of our approach. First of all, it is adapted to the specific translation task at hand by taking the corresponding source (target) language into account. Second, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains.

[1]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[2]  Michael Paul,et al.  Overview of the IWSLT06 evaluation campaign , 2006, IWSLT.

[3]  I. Dan Melamed Automatic Discovery of Non-Compositional Compounds in Parallel Data , 1997, EMNLP.

[4]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[5]  Yanjun Ma,et al.  Bilingually Motivated Word Segmentation for SMT , 2010 .

[6]  Hermann Ney,et al.  Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation , 2008, COLING.

[7]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[8]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[9]  Dekai Wu,et al.  Stochastic Inversion Transduction Grammars and Bilingual Parsing of Parallel Corpora , 1997, CL.

[10]  Alexander M. Fraser,et al.  Getting the Structure Right for Word Alignment: LEAF , 2007, EMNLP-CoNLL.

[11]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[12]  Philipp Koehn,et al.  Constraining the Phrase-Based, Joint Probability Statistical Translation Model , 2006, WMT@HLT-NAACL.

[13]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[14]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[15]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[18]  Yanjun Ma,et al.  Improving Word Alignment Using Syntactic Dependencies , 2008, SSST@ACL.

[19]  Jörg Tiedemann,et al.  Combining Clues for Word Alignment , 2003, EACL.

[20]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[21]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[22]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[23]  Eiichiro Sumita,et al.  Improved Statistical Machine Translation by Multiple Chinese Word Segmentation , 2008, WMT@ACL.

[24]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[25]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[26]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[27]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[28]  N. H. Beebe A Complete Bibliography of ACM Transactions on Asian Language Information Processing , 2007 .

[29]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[30]  Vasileios Hatzivassiloglou,et al.  Translating Collocations for Bilingual Lexicons: A Statistical Approach , 1996, CL.

[31]  William J. Byrne,et al.  HMM Word and Phrase Alignment for Statistical Machine Translation , 2005, HLT.

[32]  William J. Byrne,et al.  MTTK: An Alignment Toolkit for Statistical Machine Translation , 2006, NAACL.

[33]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[34]  Hermann Ney,et al.  Integrated Chinese Word Segmentation in Statistical Machine Translation , 2005, IWSLT.

[35]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[36]  Hermann Ney,et al.  Improved backing-off for M-gram language modeling , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[37]  K. J. Evans,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[38]  Eiichiro Sumita,et al.  Toward a Broad-coverage Bilingual Corpus for Speech Translation of Travel Conversations in the Real World , 2002, LREC.

[39]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[40]  Daniel Marcu,et al.  A Phrase-Based,Joint Probability Model for Statistical Machine Translation , 2002, EMNLP.