A Novel Approach for English to South Dravidian Language Statistical Machine Translation System

Development of a well fledged bilingual machine translation (MT) system for any two natural languages with limited electronic resources and tools is a challenging and demanding task. This paper presents the development of a statistical machine translation (SMT) system for English to South Dravidian languages like Malayalam and Kannada by incorporating syntactic and morphological information. SMT is a data oriented statistical framework for translating text from one natural language to another based on the knowledge extracted from bilingual corpus. Even though there are efforts towards building such an English to South Dravidian translation system ,unfortunately we do not have an efficient translation system till now. The first and most important step in SMT is creating a well aligned parallel corpus for training the system. Experimental research shows that the existing methodology for bilingual parallel corpus creation is not efficient for English to South Dravidian language in the SMT system. In order to increase the performance of the translation system, we have introduced a new approach in creating parallel corpus. The main ideas which we have implemented and proven very effective for English to south Dravidian languages SMT system are: (i) reordering the English source sentence according to Dravidian syntax, (ii) using the root suffix separation on both English and Dravidian words and iii) use of morphological information which substantially reduce the corpus size required for training the system. Since the unavailability of full fledged parsing and morphological tools for Malayalam and Kannada languages, sentence synthesis was done both manually and existing morph analyzer created by Amrita university. From the experiment we found that the performance of our systems are significantly well and achieves a very competitive accuracy for small sized bilingual corpora. The proposed ideas can be directly used for other south Dravidian languages like Tamil and Telugu with some minor changes. Keywords-SMT; Dravidian languages; parsing; morphology; inflections

[1]  I. Dan Melamed,et al.  Statistical Machine Translation by Parsing , 2004, ACL.

[2]  Pushpak Bhattacharyya,et al.  Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[3]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[4]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[5]  Hermann Ney,et al.  Statistical Machine Translation with a Small Amount of Bilingual Training Data , 2006 .

[6]  Hermann Ney,et al.  POS-based Word Reorderings for Statistical Machine Translation , 2006, LREC.

[7]  Edie Rasmussen,et al.  Proceedings of the 10th international conference on Asian digital libraries: looking back 10 years and forging new frontiers , 2007 .

[8]  Shalini R. Urs,et al.  Development of Prototype Morphological Analyzer for he South Indian Language of Kannada , 2007, ICADL.

[9]  E. Sumita,et al.  Practical Approach to Syntax-based Statistical Machine Translation , 2005, MTSUMMIT.

[10]  Eduard Hovy,et al.  Machine Translation: Interlingual Methods , 2006 .

[11]  Kevin Knight,et al.  A Syntax-based Statistical Translation Model , 2001, ACL.

[12]  K. P. Soman,et al.  Kernel based part of speech tagger for Kannada , 2010, 2010 International Conference on Machine Learning and Cybernetics.

[13]  Nizar Habash Syntactic preprocessing for statistical machine translation , 2007, MTSUMMIT.

[14]  Daniel Marcu,et al.  Syntax-based Alignment of Multiple Translations: Extracting Paraphrases and Generating New Sentences , 2003, NAACL.

[15]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.

[16]  Hermann Ney,et al.  Statistical Machine Translation with Scarce Resources Using Morpho-syntactic Information , 2004, CL.

[17]  Daniel Marcu,et al.  SPMT: Statistical Machine Translation with Syntactified Target Language Phrases , 2006, EMNLP.

[18]  Hermann Ney,et al.  Improved Alignment Models for Statistical Machine Translation , 1999, EMNLP.

[19]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .