Unsupervised Tokenization for Machine Translation

Training a statistical machine translation starts with tokenizing a parallel corpus. Some languages such as Chinese do not incorporate spacing in their writing system, which creates a challenge for tokenization. Moreover, morphologically rich languages such as Korean present an even bigger challenge, since optimal token boundaries for machine translation in these languages are often unclear. Both rule-based solutions and statistical solutions are currently used. In this paper, we present unsupervised methods to solve tokenization problem. Our methods incorporate information available from parallel corpus to determine a good tokenization for machine translation.

[1]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[2]  Carl de Marcken Linguistic Structure as Composition and Perturbation , 1996, ACL.

[3]  I. Dan Melamed,et al.  Bitext Maps and Alignment via Pattern Recognition , 1999, CL.

[4]  Nianwen Xue,et al.  Developing Guidelines and Ensuring Consistency for Chinese Text Annotation , 2000, LREC.

[5]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[6]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[7]  Hirofumi Yamamoto,et al.  Finding Translation Pairs from English-Japanese Untokenized Aligned Corpora , 2002, Speech-to-Speech Translation@ACL.

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Matthew J. Beal Variational algorithms for approximate Bayesian inference , 2003 .

[10]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[11]  Nianwen Xu,et al.  Chinese Word Segmentation as Character Tagging , 2003, Int. J. Comput. Linguistics Chin. Lang. Process..

[12]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[13]  Thomas L. Griffiths,et al.  Contextual Dependencies in Unsupervised Word Segmentation , 2006, ACL.

[14]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[15]  Mark Johnson,et al.  Why Doesn’t EM Find Good HMM POS-Taggers? , 2007, EMNLP.

[16]  Wolfgang Macherey,et al.  Lattice-based Minimum Error Rate Training for Statistical Machine Translation , 2008, EMNLP.

[17]  Christopher D. Manning,et al.  Optimizing Chinese Word Segmentation for Machine Translation Performance , 2008, WMT@ACL.

[18]  Daniel Gildea,et al.  Bayesian Learning of Non-Compositional Phrases with Synchronous Parsing , 2008, ACL.

[19]  Hermann Ney,et al.  Bayesian Semi-Supervised Chinese Word Segmentation for Statistical Machine Translation , 2008, COLING.

[20]  Eiichiro Sumita,et al.  Improved Statistical Machine Translation by Multiple Chinese Word Segmentation , 2008, WMT@ACL.