Experiment on a phrase-based statistical machine translation using PoS Tag information for Sundanese into Indonesian

This paper discusses the problem of Sundanese into Indonesian text translation, as one of low-resource language pair translation. The number of parallel corpus gives a significant impact on a statistical machine translation. Whereas to date, there are no Sundanese to Indonesian parallel corpus that ready to use. It is, therefore, we apply the PoS Tag rather than only surface form in the translation model to get a better translation result. This experiment was done to get an early result in Sundanese to Indonesian text translation and to identify problems arise on it. The result shows that the model using surface form and PoS Tag was slightly outperformed the model using only surface form. However, there are some problems faced in this experiment which are the large number of OOV caused by the limited number of parallel corpus and unproper phrase translation caused by some noise in the parallel corpus such as typos and inconsistency writing a word in Sundanese corpus.

[1]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[2]  Daniel Marcu,et al.  Statistical Phrase-Based Translation , 2003, NAACL.

[3]  Stella Markantonatou,et al.  METIS-II: low resource machine translation , 2008, Machine Translation.

[4]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[5]  Ayu Purwarianti,et al.  HMM Based Part-of-Speech Tagger f or Bahasa Indonesia , 2010 .

[6]  Mauro Cettolo,et al.  IRSTLM: an open source toolkit for handling large scale language models , 2008, INTERSPEECH.

[7]  Miles Osborne,et al.  Statistical Machine Translation , 2010, Encyclopedia of Machine Learning and Data Mining.

[8]  Chris Callison-Burch,et al.  Combining Bilingual and Comparable Corpora for Low Resource Machine Translation , 2013, WMT@ACL.

[9]  Anoop Sarkar,et al.  Incremental Decoding for Phrase-Based Statistical Machine Translation , 2010, WMT@ACL.

[10]  William J. Byrne,et al.  Phrasal Segmentation Models for Statistical Machine Translation , 2008, COLING.

[11]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[12]  Jeff Z. Ma,et al.  Improving Low-Resource Statistical Machine Translation with a Novel Semantic Word Clustering Algorithm , 2011, MTSUMMIT.

[13]  Stephan Vogel,et al.  Utilizing Target-Side Semantic Role Labels to Assist Hierarchical Phrase-based Machine Translation , 2011, SSST@ACL.

[14]  Hermann Ney,et al.  Phrase-Based Statistical Machine Translation , 2002, KI.

[15]  Dennis Nolan Mehay,et al.  CCG Syntactic Reordering Models for Phrase-based Machine Translation , 2012, WMT@NAACL-HLT.