RuLearn: an Open-source Toolkit for the Automatic Inference of Shallow-transfer Rules for Machine Translation

Abstract This paper presents ruLearn, an open-source toolkit for the automatic inference of rules for shallow-transfer machine translation from scarce parallel corpora and morphological dictionaries. ruLearn will make rule-based machine translation a very appealing alternative for under-resourced language pairs because it avoids the need for human experts to handcraft transfer rules and requires, in contrast to statistical machine translation, a small amount of parallel corpora (a few hundred parallel sentences proved to be sufficient). The inference algorithm implemented by ruLearn has been recently published by the same authors in Computer Speech & Language (volume 32). It is able to produce rules whose translation quality is similar to that obtained by using hand-crafted rules. ruLearn generates rules that are ready for their use in the Apertium platform, although they can be easily adapted to other platforms. When the rules produced by ruLearn are used together with a hybridisation strategy for integrating linguistic resources from shallow-transfer rule-based machine translation into phrase-based statistical machine translation (published by the same authors in Journal of Artificial Intelligence Research, volume 55), they help to mitigate data sparseness. This paper also shows how to use ruLearn and describes its implementation.

[1]  Víctor M. Sánchez-Cartagena,et al.  An open-source toolkit for integrating shallow-transfer rules into phrase-based statistical machine translation , 2012, FREEOPMT.

[2]  Víctor M. Sánchez-Cartagena,et al.  A generalised alignment template formalism and its application to the inference of shallow-transfer machine translation rules from scarce bilingual corpora , 2015, Comput. Speech Lang..

[3]  Philip Koehn,et al.  Statistical Machine Translation , 2010, EAMT.

[4]  Nikola Ljubesic,et al.  Collaborative Development of a Rule-Based Machine Translator between Croatian and Serbian , 2016, EAMT.

[5]  Mikel L. Forcada,et al.  Automatic induction of bilingual resources from aligned parallel corpora: application to shallow-transfer machine translation , 2007, Machine Translation.

[6]  Víctor M. Sánchez-Cartagena,et al.  Integrating Rules and Dictionaries from Shallow-Transfer Machine Translation into Phrase-Based Statistical Machine Translation , 2016, J. Artif. Intell. Res..

[7]  István Varga,et al.  Transfer rule generation for a Japanese-Hungarian machine translation system , 2009, MTSUMMIT.

[8]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[9]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[10]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[11]  Matthew J. Saltzman,et al.  Computational Experience with a Software Framework for Parallel Integer Programming , 2009, INFORMS J. Comput..

[12]  Francis M. Tyers,et al.  Apertium: a free/open-source platform for rule-based machine translation , 2011, Machine Translation.

[13]  Mikel L. Forcada,et al.  Inferring Shallow-Transfer Machine Translation Rules from Small Parallel Corpora , 2014, J. Artif. Intell. Res..

[14]  Marta R. Costa-jussà,et al.  Description of the Chinese-to-Spanish Rule-Based Machine Translation System Developed Using a Hybrid Combination of Human Annotation and Statistical Techniques , 2016, ACM Trans. Asian Low Resour. Lang. Inf. Process..

[15]  Min-Yen Kan,et al.  Perspectives on crowdsourcing annotations for natural language processing , 2012, Language Resources and Evaluation.