English-latvian SMT: the challenge of translating into a free word order language

This paper presents a comparative study of two approaches to statistical machine translation (SMT) and their application to a task of English-to-Latvian translation, which is still an open research line in the field of automatic translation. We consider a state-of-the-art phrase-based SMT and an alternative N-gram-based SMT systems. The major differences between these two approaches lie in the distinct representations of bilingual units, which are the components of the bilingual model driving translation process and in the statistical modeling of the translation context. Latvian being a rather free word order language implies additional difficulties to the translation process. We contrast different reordering models and investigate how well they deal with the word ordering issue. Moving beyond automatic scores of translation quality that are classically presented in MT research papers, we contribute presenting a manual error analysis of MT systems output that helps to shed light on advantages and disadvantages of the SMT systems under consideration and identify the most prominent source of errors typical for both SMT systems.

[1]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[2]  José A. R. Fonollosa,et al.  N-Gram-Based Statistical Machine Translation versus Syntax Augmented Machine Translation: Comparison and System Combination , 2009, EACL.

[3]  Francisco Casacuberta,et al.  Architectures for Speech-to-Speech Translation Using Finite-state Models , 2002, Speech-to-Speech Translation@ACL.

[4]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.

[5]  Mark Dras,et al.  Statistical Machine Translation of Australian Aboriginal Languages: Morphological Analysis with Languages of Differing Morphological Richness , 2007, ALTA.

[6]  Christoph Tillmann,et al.  A Unigram Orientation Model for Statistical Machine Translation , 2004, NAACL.

[7]  Hermann Ney,et al.  The Alignment Template Approach to Statistical Machine Translation , 2004, CL.

[8]  Hermann Ney,et al.  Discriminative Training and Maximum Entropy Models for Statistical Machine Translation , 2002, ACL.

[9]  José B. Mariño,et al.  An n-gram-based statistical machine translation decoder , 2005, INTERSPEECH.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Hermann Ney,et al.  Accelerated DP based search for statistical translation , 1997, EUROSPEECH.

[12]  Hermann Ney,et al.  Error Analysis of Statistical Machine Translation Output , 2006, LREC.

[13]  Cyril Goutte Automatic Evaluation of Machine Translation Quality , 2006 .

[14]  José B. Mariño,et al.  Improving statistical MT by coupling reordering and decoding , 2006, Machine Translation.

[15]  José A. R. Fonollosa,et al.  Ngram-based versus Phrase-based Statistical Machine Translation , 2005, IWSLT.

[16]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[17]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[18]  Beryl Hoffman,et al.  Translating into Free Word Order Languages , 1996, COLING.

[19]  José B. Mariño,et al.  N-gram-based versus phrase-based statistical machine translation , 2005, IWSLT.

[20]  Franz Josef Och,et al.  A Systematic Comparison of Phrase-Based, Hierarchical and Syntax-Augmented Statistical MT , 2008, COLING.

[21]  José B. Mariño,et al.  N-gram-based Machine Translation , 2006, CL.

[22]  Marta R. Costa-jussà,et al.  Statistical Machine Reordering , 2006, EMNLP.

[23]  Hervé Bourlard,et al.  On the Use of Information Retrieval Measures for Speech Recognition Evaluation , 2004 .

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.