论文信息 - Morphological Processing for English-Tamil Statistical Machine Translation

Morphological Processing for English-Tamil Statistical Machine Translation

Various experiments from literature suggest that in statistical machine translation (SMT), applying either pre-processing or post-processing to morphologically rich languages leads to better translation quality. In this work, we focus on the English-Tamil language pair. We implement suffix-separation rules for both of the languages and evaluate the impact of this preprocessing on translation quality of the phrase-based as well as hierarchical model in terms of BLEU score and a small manual evaluation. The results confirm that our simple suffix-based morphological processing helps to obtain better translation performance. A by-product of our efforts is a new parallel corpus of 190k sentence pairs gathered from the web.

Ondřej Bojar | Zdeněk Žabokrtský | Loganathan Ramasamy

[1] Pushpak Bhattacharyya,et al. Case markers and Morphology: Addressing the crux of the fluency problem in English-Hindi SMT , 2009, ACL.

[2] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3] Daniel Zeman,et al. English–Hindi Translation in 21 Days , 2008 .

[4] Hermann Ney,et al. Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[5] Philipp Koehn,et al. 462 Machine Translation Systems for Europe , 2009, MTSUMMIT.

[6] Ulrich Germann. Building a Statistical Machine Translation System from Scratch: How Much Bang for the Buck Can We Expect? , 2001, DDMMT@ACL.

[7] Tanveer A. Faruquie,et al. An English-Hindi Statistical Machine Translation System , 2004, IJCNLP.

[8] Philipp Koehn,et al. Findings of the 2010 Joint Workshop on Statistical Machine Translation and Metrics for Machine Translation , 2010, WMT@ACL.

[9] Young-Suk Lee,et al. Morphological Analysis for Statistical Machine Translation , 2004, NAACL.

[10] Ondrej Bojar,et al. A Grain of Salt for the WMT Manual Evaluation , 2011, WMT@EMNLP.

[11] Pushpak Bhattacharyya,et al. Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[12] Matt Post,et al. Constructing Parallel Corpora for Six Indian Languages via Crowdsourcing , 2012, WMT@NAACL-HLT.

[13] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14] Thomas Lehmann,et al. A grammar of modern Tamil , 1993 .

[15] András Kornai,et al. Parallel corpora for medium density languages , 2007 .

[16] David Chiang,et al. A Hierarchical Phrase-Based Model for Statistical Machine Translation , 2005, ACL.