Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation

Although their use in training quality machine translation systems has been proven, parallel corpora — large collections of translated texts — are generally hard to come by for the majority of languages. To counteract this fact, a relatively small collection may be processed in more depth by further cleaning and more accurately splitting and aligning the texts. We apply this to an existing English/Zulu parallel corpus that has been used for statistical machine translation experiments. After these preprocessing steps, we run the same experiments for comparative purposes. Our results suggest that compatibility of bitexts, the choice of sentence splitters used on different parts of the text, as well as manual work, may have a notable effect on both the corpus size and on automatic translation quality.

[1]  Roald Eiselen,et al.  Developing Text Resources for Ten South African Languages , 2014, LREC.

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Jörg Tiedemann,et al.  Billions of Parallel Words for Free: Building and Using the EU Bookshop Corpus , 2014, LREC.

[4]  Helmut Schmidt,et al.  Probabilistic part-of-speech tagging using decision trees , 1994 .

[5]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[6]  Mikko Kurimo,et al.  Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline , 2013 .

[7]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[8]  Kenneth Heafield,et al.  KenLM: Faster and Smaller Language Model Queries , 2011, WMT@EMNLP.

[9]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[10]  Philipp Koehn,et al.  Synthesis Lectures on Human Language Technologies , 2016 .

[11]  B. Harris Bi-text, a new concept in translation theory , 1988 .

[12]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[13]  Rico Sennrich,et al.  Extrinsic evaluation of sentence alignment systems , 2012 .

[14]  Franz Josef Och,et al.  Minimum Error Rate Training in Statistical Machine Translation , 2003, ACL.

[15]  Jörg Tiedemann,et al.  Docent: A Document-Level Decoder for Phrase-Based Statistical Machine Translation , 2013, ACL.

[16]  Gideon Kotzé,et al.  Syllabification and parameter optimisation in Zulu to English machine translation , 2015 .

[17]  Gilles-Maurice de Schryver Web for/as corpus: a perspective for the African languages , 2002 .