Bilingually Motivated Domain-Adapted Word Segmentation for Statistical Machine Translation

We introduce a word segmentation approach to languages where word boundaries are not orthographically marked, with application to Phrase-Based Statistical Machine Translation (PB-SMT). Instead of using manually segmented monolingual domain-specific corpora to train segmenters, we make use of bilingual corpora and statistical word alignment techniques. First of all, our approach is adapted for the specific translation task at hand by taking the corresponding source (target) language into account. Secondly, this approach does not rely on manually segmented training data so that it can be automatically adapted for different domains. We evaluate the performance of our segmentation approach on PB-SMT tasks from two domains and demonstrate that our approach scores consistently among the best results across different data conditions.

[1]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[2]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[3]  Yanjun Ma,et al.  Bootstrapping Word Alignment via Word Packing , 2007, ACL.

[4]  K. J. Evans,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1990 .

[5]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[6]  Eiichiro Sumita,et al.  Improved Statistical Machine Translation by Multiple Chinese Word Segmentation , 2008, WMT@ACL.

[7]  Smaranda Muresan,et al.  Generalizing Word Lattice Translation , 2008, ACL.

[8]  Daniel Jurafsky,et al.  A Conditional Random Field Word Segmenter for Sighan Bakeoff 2005 , 2005, IJCNLP.

[9]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[10]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[11]  I. Dan Melamed,et al.  Models of translation equivalence among words , 2000, CL.

[12]  Scott M. Smith,et al.  Computer Intensive Methods for Testing Hypotheses: An Introduction , 1989 .

[13]  Chilin Shih,et al.  A Stochastic Finite-State Word-Segmentation Algorithm for Chinese , 1994, ACL.

[14]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[15]  William J. Byrne,et al.  HMM Word and Phrase Alignment for Statistical Machine Translation , 2005, HLT.

[16]  William J. Byrne,et al.  MTTK: An Alignment Toolkit for Statistical Machine Translation , 2006, NAACL.

[17]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[18]  Hermann Ney,et al.  Integrated Chinese Word Segmentation in Statistical Machine Translation , 2005, IWSLT.

[19]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[20]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[21]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[22]  Hermann Ney,et al.  Do We Need Chinese Word Segmentation for Statistical Machine Translation? , 2004, SIGHAN@ACL.

[23]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[24]  Qun Liu,et al.  HHMM-based Chinese Lexical Analyzer ICTCLAS , 2003, SIGHAN.