A prototype English-Turkish statistical machine translation system

Translating one natural language (text or speech) to another natural language automatically is known as machine translation. Machine translation is one of the major, oldest and the most active areas in natural language processing. The last decade and a half have seen the rise of the use of statistical approaches to the problem of machine translation. Statistical approaches learn translation parameters automatically from alignment text instead of relying on writing rules which is labor intensive. Although there has been quite extensive work in this area for some language pairs, there has not been research for the Turkish - English language pair. In this thesis, we present the results of our investigation and development of a state-of-theart statistical machine translation prototype from English to Turkish. Developing an English to Turkish statistical machine translation prototype is an interesting problem from a number of perspectives. The most important challenge is that English and Turkish are typologically rather distant languages. While English has very limited morphology and rather fixed Subject-Verb-Object constituent order, Turkish is an agglutinative language with very flexible (but Subject-Object-Verb dominant) constituent order and a very rich and productive derivational and inflectional morphology with word structures that can correspond to complete phrases of several words in English when translated. Our research is focused on making scientific contributions to the state-of-the-art by taking into account certain morphological properties of Turkish (and possibly similar languages) that have not been addressed sufficiently in previous research for other languages. In this thesis; we investigate how different morpheme-level representations of morphology on both the English and the Turkish sides impact statistical translation results. We experiment with local word ordering on the English side to bring the word order of specific English prepositional phrases and auxiliary verb complexes, in line with the corresponding case marked noun forms and complex verb forms, on the Turkish side to help with word alignment. We augment the training data with sentences just with content words (noun, verb, adjective, adverb) obtained from the original training data and with highly-reliable phrase-pairs obtained iteratively from an earlier phrase alignment to alleviate the dearth of the parallel data available. We use word-based language model in the reranking of the n-best lists in addition to the morpheme-based language model used for decoding, so that we can incorporate both the local morphotactic constraints and local word ordering constraints. Lastly, we present a procedure for repairing the decoder output by correcting words with incorrect morphological structure and out-of-vocabulary with respect to the training data and language model to further improve the translations. We also include fine-grained evaluation results and some oracle scores with the BLEU+ tool which is an extension of the evaluation metric BLEU. After all research and development, we improve from 19.77 BLEU points for our word-based baseline model to 27.60 BLEU points for an improvement of 7.83 points or about 40% relative improvement.

[1]  Robert L. Mercer,et al.  Aligning Sentences in Parallel Corpora , 1991, ACL.

[2]  Sharon Goldwater,et al.  Improving Statistical MT through Morphological Analysis , 2005, HLT.

[3]  Daniel Marcu,et al.  Fast Decoding and Optimal Decoding for Machine Translation , 2001, ACL.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  C. Fellbaum An Electronic Lexical Database , 1998 .

[6]  Kristina Toutanova,et al.  Generating Complex Morphology for Machine Translation , 2007, ACL.

[7]  John D. Lafferty,et al.  The Candide System for Machine Translation , 1994, HLT.

[8]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[9]  Miles Osborne,et al.  Modelling Lexical Redundancy for Machine Translation , 2006, ACL.

[10]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[11]  Chao Wang,et al.  Chinese Syntactic Reordering for Statistical Machine Translation , 2007, EMNLP.

[12]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[13]  Michael Gamon,et al.  Normalizing German and English inflectional morphology to improve statistical word alignment , 2004, AMTA.

[14]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[15]  Young-Suk Lee,et al.  Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[16]  Fei Xia,et al.  Improving a Statistical MT System with Automatically Learned Rewrite Patterns , 2004, COLING.

[17]  Kemal Oflazer,et al.  Building a wordnet for Turkish , 2004 .

[18]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[19]  Kevin Knight,et al.  Decoding Complexity in Word-Replacement Translation Models , 1999, Comput. Linguistics.

[20]  George A. Miller,et al.  WordNet: A Lexical Database for English , 1995, HLT.

[21]  I. Dan Melamed,et al.  A Geometric Approach to Mapping Bitext Correspondence , 1996, EMNLP.

[22]  David Chiang,et al.  A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.

[23]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[24]  Jorge Hankamer,et al.  Morphological parsing and the lexicon , 1989 .

[25]  Hermann Ney,et al.  Improved Statistical Alignment Models , 2000, ACL.

[26]  Hermann Ney,et al.  POS-based Word Reorderings for Statistical Machine Translation , 2006, LREC.

[27]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[28]  George R. Doddington,et al.  Automatic Evaluation of Machine Translation Quality Using N-gram Co-Occurrence Statistics , 2002 .

[29]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[30]  Robert C. Moore Fast and accurate sentence alignment of bilingual corpora , 2002, AMTA.

[31]  W. J. Hutchins,et al.  Machine Translation: A Brief History , 1995 .

[32]  Robert L. Mercer,et al.  But Dictionaries Are Data Too , 1993, HLT.

[33]  John D. Lafferty,et al.  Analysis, statistical transfer, and synthesis in machine translation , 1992, TMI.

[34]  Hermann Ney,et al.  An Evaluation Tool for Machine Translation: Fast Evaluation for MT Research , 2000, LREC.

[35]  Hermann Ney,et al.  Improvements in Phrase-Based Statistical Machine Translation , 2004, NAACL.

[36]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[37]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[38]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[39]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[40]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[41]  Robert L. Mercer,et al.  The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[42]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[43]  Ulrich Germann,et al.  Greedy Decoding for Statistical Machine Translation in Almost Linear Time , 2003, NAACL.

[44]  Hermann Ney,et al.  HMM-Based Word Alignment in Statistical Translation , 1996, COLING.

[45]  Adam L. Berger,et al.  A Maximum Entropy Approach to Natural Language Processing , 1996, CL.

[46]  Kemal Oflazer,et al.  Two-level Description of Turkish Morphology , 1993, EACL.

[47]  Shankar Kumar,et al.  A Weighted Finite State Transducer Implementation of the Alignment Template Model for Statistical Machine Translation , 2003, NAACL.

[48]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[49]  Deniz Yuret,et al.  Learning Morphological Disambiguation Rules for Turkish , 2006, NAACL.

[50]  Cigdem Keyder Turhan An English to Turkish Machine Translation System Using Structural Mapping , 1997, ANLP.