Unsupervised Segmentation for Statistical Machine Translation

An unsupervised approach is applied to segment German-English and French-English parallel corpora for statistical machine translation. The approach requires no languagenor domain-specific knowledge whatsoever. Segmentation is shown to effectively reduce the number of unknown words and singletons in the corpora which helps improve the translation model. As a result, word error rates are lowered by 0.37% and 2.15% in the translation of German to English and French to English respectively. The benefits of segmentation to statistical machine translation are more pronounced when the training data size is small.

[1]  Andrew J. Lundberg,et al.  Discovering Morphemic Suffixes A Case Study In MDL Induction , 1995 .

[2]  Carl de Marcken,et al.  The Unsupervised Acquisition of a Lexicon from Continuous Speech , 1995, ArXiv.

[3]  Ronald Rosenfeld,et al.  Statistical language modeling using the CMU-cambridge toolkit , 1997, EUROSPEECH.

[4]  Daniel Marcu,et al.  Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[5]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[6]  Carl de Marcken Linguistic Structure as Composition and Perturbation , 1996, ACL.

[7]  Lillian Lee,et al.  Unsupervised Statistical Segmentation of Japanese Kanji Strings , 1999 .

[8]  David Yarowsky,et al.  Minimally Supervised Morphological Analysis by Multimodal Alignment , 2000, ACL.

[9]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[10]  Hermann Ney,et al.  Using POS information for statistical machine translation into morphologically rich languages , 2003, Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - EACL '03.

[11]  Lillian Lee,et al.  Mostly-unsupervised statistical segmentation of Japanese kanji sequences , 2002, Natural Language Engineering.

[12]  Yunke. Hua UNSUPERVISED WORD INDUCTION USING MDL CRITERION , 2000 .

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  John D. Lafferty,et al.  Analysis, statistical transfer, and synthesis in machine translation , 1992, TMI.

[15]  Harold L. Somers,et al.  An introduction to machine translation , 1992 .

[16]  Chris Callison-Burch,et al.  Bootstrapping Parallel Corpora , 2003, ParallelTexts@NAACL-HLT.

[17]  Jorma Rissanen,et al.  Language acquisition in the MDL framework , 1992, Language Computations.

[18]  Natalia Grabar,et al.  Language-independent automatic acquisition of morphological knowledge from synonym pairs , 1999, AMIA.

[19]  T. Poggio,et al.  MASSACHUSETTS INSTITUTE OF TECHNOLOGY ARTIFICIAL INTELLIGENCE LABORATORY and CENTER FOR BIOLOGICAL AND COMPUTATIONAL LEARNING DEPARTMENT OF BRAIN AND COGNITIVE SCIENCES , 2001 .

[20]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[21]  John A. Goldsmith,et al.  Unsupervised Learning of the Morphology of a Natural Language , 2001, CL.

[22]  Frédéric Bimbot,et al.  Inference of variable-length acoustic units for continuous speech recognition , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[23]  Hermann Ney,et al.  Improving SMT quality with morpho-syntactic analysis , 2000, COLING.

[24]  Philipp Koehn,et al.  Empirical Methods for Compound Splitting , 2003, EACL.

[25]  Daniel Jurafsky,et al.  Knowledge-Free Induction of Inflectional Morphologies , 2001, NAACL.

[26]  Christian Jacquemin,et al.  Guessing morphology from terms and corpora , 1997, SIGIR '97.

[27]  Suresh Manandhar,et al.  Unsupervised Learning of Word Segmentation Rules with Genetic Algorithms and Inductive Logic Programming , 2001, Machine Learning.

[28]  Thomas G. Dietterich What is machine learning? , 2020, Archives of Disease in Childhood.

[29]  Yaser Al-Onaizan,et al.  Translating with Scarce Resources , 2000, AAAI/IAAI.

[30]  Chris Callison-Burch,et al.  Co-training for Statistical Machine Translation , 2002 .