Development of Indonesian-Japanese statistical machine translation using lemma translation and additional post-process

Despite the fact that study of statistical machine translation has been growing rapidly to date, there has not been much research done about Indonesian-Japanese statistical machine translation. The previous research about Indonesian-Japanese statistical machine translation has shown several problems in translation process, such as low coverage corpus data, unknown words, and sentence reordering problem. In this research, we propose two methods to address these problems. The proposed methods are lemma translation with generated surface form and additional post-process. Lemma translation uses lemma and POSTAG of word in its translation process. Rule based katakana translation and unknown word substitution are also used for additional post-process. Experimental data was collected from JLPT (Japanese Language Proficiency Test) Level 3 with total 1132 sentences. Experimental results using these methods showed an improvement over the baseline system with a 116% increased BLEU score on Japanese to Indonesian translation and 26% increased BLEU score on Indonesian to Japanese translation.