论文信息 - Tackling Close Cousins: Experiences In Developing Statistical Machine Translation Systems For Marathi And Hindi

Tackling Close Cousins: Experiences In Developing Statistical Machine Translation Systems For Marathi And Hindi

In this paper we present our experiences in building Statistical Machine Translation (SMT) systems for the Indian Language pair Marathi and Hindi, which are close cousins. We briefly point out the similarities and differences between the two languages, stressing on the phenomenon of Krudantas (Verb Groups) translation, which is something Rule based systems are not able to do well. Marathi, being a language with agglutinative suffixes, poses a challenge due to lack of coverage of all word forms in the corpus; to remedy which, we explored Factored SMT, that incorporate linguistic analyses in a variety of ways. We evaluate our systems and through error analyses, show that even with small size corpora we can get substantial improvement of approximately 10-15% in translation quality, over the baseline, just by incorporating morphological analysis. We also indirectly evaluate our SMT systems by analysing and reporting the improvement in the quality of translations of a Marathi to Hindi Rule Based system (Sampark) by injecting SMT translations of Krudantas. We believe that our work will help researchers working with limited corpora on similar morphologically rich language pairs and relatable phenomena to develop quality MT systems.

Pushpak Bhattacharyya | Raj Dabre | Jyotesh Choudhari

[1] Philipp Koehn,et al. Factored Translation Models , 2007, EMNLP.

[2] Kemal Oflazer,et al. Initial Explorations in English to Turkish Statistical Machine Translation , 2006, WMT@HLT-NAACL.

[3] Pushpak Bhattacharyya,et al. Morphological Analyzer for Affix Stacking Languages: A Case Study of Marathi , 2012, COLING.

[4] Pushpak Bhattacharyya,et al. A Paradigm-Based Finite State Morphological Analyzer for Marathi , 2010 .

[5] Ondrej Bojar,et al. No Free Lunch in Factored Phrase-Based Machine Translation , 2013, CICLing.

[6] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[7] Philipp Koehn,et al. Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[8] Robert L. Mercer,et al. The Mathematics of Statistical Machine Translation: Parameter Estimation , 1993, CL.

[9] Daniel Marcu,et al. Statistical Phrase-Based Translation , 2003, NAACL.