Integration of Bilingual Lists for Domain-Specific Statistical Machine Translation for Sinhala-Tamil

Availability of quality parallel data is a major requirement to build a reasonably well performing statistical machine translation (SMT) system. Thus, developing a decent SMT system for a low-resourced language pair like Sinhala and Tamil that does not have a large parallel corpus is rather challenging. Past research for other different language pairs has shown that different terminology / bilingual list integration methodologies can be used to improve the quality of SMT systems, for domain-specific SMT in particular. In this paper, we explore if this can be effective for Sinhala-Tamil machine translation for the domain of official government documents. We evaluate the impact of three types of bilingual lists, namely, a list of government organizations and official designations, a glossary related to government administration and operations, and a general bilingual dictionary, based on four different methodologies (three static and one dynamic). Out of four, one methodology gave notable improvements for all three types of list over the baseline.

[1]  Valia Kordoni,et al.  Using Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs , 2016, MWE@ACL.

[2]  Vysoké Učení,et al.  Statistical Language Models Based on Neural Networks , 2012 .

[3]  Hermann Ney,et al.  A Systematic Comparison of Various Statistical Alignment Models , 2003, CL.

[4]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[5]  R. Weerasinghe A Statistical Machine Translation Approach to Sinhala-Tamil Language Translation , 2003 .

[6]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[7]  Marcis Pinnis,et al.  Dynamic Terminology Integration Methods in Statistical Machine Translation , 2015, EAMT.

[8]  Raivis Skadiņš,et al.  Application of Online Terminology Services in Statistical Machine Translation , 2013, MTSUMMIT.

[9]  Surangika Ranathunga,et al.  Neural machine translation for sinhala and tamil languages , 2017, 2017 International Conference on Asian Language Processing (IALP).

[10]  Thomas Eckart,et al.  Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages , 2012, LREC.

[11]  A. R. Weerasinghe,et al.  Statistical machine translation of systems for Sinhala - Tamil , 2010, 2010 International Conference on Advances in ICT for Emerging Regions (ICTer).

[12]  Josef van Genabith,et al.  Passive and Pervasive Use of Bilingual Dictionary in Statistical Machine Translation , 2015, HyTra@ACL.

[13]  T. V. Geetha,et al.  Semi-supervised Bootstrapping approach for Named Entity Recognition , 2015, ArXiv.

[14]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[15]  Valia Kordoni,et al.  Multiword Expressions in Machine Translation , 2014, LREC.

[16]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[17]  Surangika Ranathunga,et al.  Automatic Creation of a Sentence Aligned Sinhala-Tamil Parallel Corpus , 2016, WSSANLP@COLING.

[18]  Pierre Zweigenbaum,et al.  Identifying bilingual Multi-Word Expressions for Statistical Machine Translation , 2012, LREC.

[19]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[20]  Marine Carpuat,et al.  Task-based Evaluation of Multiword Expressions: a Pilot Study in Statistical Machine Translation , 2010, NAACL.

[21]  Mahesan Niranjan,et al.  Sinhala-Tamil Machine Translation: Towards better Translation Quality , 2014, ALTA.

[22]  Inguna Skadiņa Multi-word Expressions in English-Latvian Machine Translation , 2016, Balt. J. Mod. Comput..

[23]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[24]  Andreas Stolcke,et al.  SRILM - an extensible language modeling toolkit , 2002, INTERSPEECH.

[25]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[26]  John Cocke,et al.  A Statistical Approach to Machine Translation , 1990, CL.