The IIT Bombay Hindi-English Translation System at WMT 2014

In this paper, we describe our EnglishHindi and Hindi-English statistical systems submitted to the WMT14 shared task. The core components of our translation systems are phrase based (Hindi-English) and factored (English-Hindi) SMT systems. We show that the use of number, case and Tree Adjoining Grammar information as factors helps to improve English-Hindi translation, primarily by generating morphological inflections correctly. We show improvements to the translation systems using pre-procesing and post-processing components. To overcome the structural divergence between English and Hindi, we preorder the source side sentence to conform to the target language word order. Since parallel corpus is limited, many words are not translated. We translate out-of-vocabulary words and transliterate named entities in a post-processing stage. We also investigate ranking of translations from multiple systems to select the best translation.

[1]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[2]  Rohit Gupta,et al.  Reordering rules for English-Hindi SMT , 2013, HyTra@ACL.

[3]  Dan Klein,et al.  Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network , 2003, NAACL.

[4]  Aravind K. Joshi,et al.  Tree-adjoining grammars and lexicalized grammars , 1992, Tree Automata and Languages.

[5]  Ananthakrishnan Ramanathan,et al.  Handling verb phrase morphology in highly inflected Indian languages for Machine Translation , 2011, IJCNLP.

[6]  Srinivas Bangalore,et al.  Supertagging: An Approach to Almost Parsing , 1999, CL.

[7]  Hermann Ney,et al.  Chunk-Level Reordering of Source Language Sentences with Automatically Learned Rules for Statistical Machine Translation , 2007, SSST@HLT-NAACL.

[8]  Dipti Misra Sharma,et al.  AnnCorra : Annotating Corpora Guidelines For POS And Chunk Annotation For Indian Languages , 2008 .

[9]  Christopher D. Manning,et al.  Stanford typed dependencies manual , 2010 .

[10]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[11]  Pushpak Bhattacharyya,et al.  Shata-Anuvadak: Tackling Multiway Translation of Indian Languages , 2014, LREC.

[12]  Dan Klein,et al.  Accurate Unlexicalized Parsing , 2003, ACL.

[13]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[14]  Chirag Shah,et al.  Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics , 2009 .

[15]  Philipp Koehn,et al.  Enriching Morphologically Poor Languages for Statistical Machine Translation , 2008, ACL.

[16]  Pushpak Bhattacharyya,et al.  Simple Syntactic and Morphological Processing Can Help English-Hindi Statistical Machine Translation , 2008, IJCNLP.

[17]  Pushpak Bhattacharyya,et al.  Interlingua-based English–Hindi Machine Translation and Language Divergence , 2001, Machine Translation.

[18]  Andy Way,et al.  Supertagged Phrase-Based Statistical Machine Translation , 2007, ACL.

[19]  Alon Lavie,et al.  METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.

[20]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[21]  A. Jain,et al.  ANGLABHARTI: a multilingual machine aided translation project on translation from English to Indian languages , 1995, 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century.

[22]  Dmitriy Genzel,et al.  Automatically Learning Source-side Reordering Rules for Large Scale Machine Translation , 2010, COLING.

[23]  Alexis Nasr,et al.  MICA: A Probabilistic Dependency Parser Based on Tree Insertion Grammars (Application Note) , 2009, HLT-NAACL.

[24]  Philipp Koehn,et al.  Clause Restructuring for Statistical Machine Translation , 2005, ACL.