Towards an Indonesian-English SMT System : A Case Study of an UnderStudied and Under-Resourced Language

This paper describes a work on preparing an Indonesian-English Statistical Machine Translation (SMT) System. It includes the creation of Indonesian morphological analyzer, MorphInd, and the composing of an Indonesian-English parallel corpus, IDENTIC. We build an SMT system using the state-of-the-art phrase-based SMT system, MOSES. We show several scenarios where the morphological tool is used to incorporate morphological information in the SMT system trained with the composed parallel corpus.

[1]  Mark Steedman,et al.  Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning , 2012 .

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Sabine Buchholz,et al.  CoNLL-X Shared Task on Multilingual Dependency Parsing , 2006, CoNLL.

[4]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[5]  Timothy Baldwin,et al.  Open Source Corpus Analysis Tools for Malay , 2006, LREC.

[6]  Hugh E. Williams,et al.  Stemming Indonesian: A confix-stripping approach , 2007, TALIP.

[7]  Septina Dian Larasati,et al.  IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus , 2012, LREC.

[8]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[9]  Hammam R. Yusuf An Analysis of Indonesian Language for Interlingual Machine-Translation System , 1992, COLING.

[10]  Ruli Manurung,et al.  A Two-Level Morphological Analyser for the Indonesian Language , 2008, ALTA.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[13]  Septina Dian Larasati,et al.  Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus , 2011, SFCM.

[14]  Mans Hulden,et al.  Foma: a Finite-State Compiler and Library , 2009, EACL.

[15]  Hwee Tou Ng,et al.  Translating from Morphologically Complex Languages: A Paraphrase-Based Approach , 2011, ACL.

[16]  Hwee Tou Ng,et al.  Source Language Adaptation for Resource-Poor Machine Translation , 2012, EMNLP.