Wiki-Translator: Multilingual Experiments for In-Domain Translations

The benefits of using comparable corpora for improving translation quality for statistical machine translators have been already shown by various researchers. The usual approach is starting with a baseline system, trained on out-of-domain parallel corpora, followed by its adaptation to the domain in which new translations are needed. The adaptation to a new domain, especially for a narrow one, is based on data extracted from comparable corpora from the new domain or from an as close as possible one. This article reports on a slightly different approach: building an SMT system entirely from comparable data for the domain of interest. Certainly, the approach is feasible if the comparable corpora are large enough to extract SMT useful data in sufficient quantities for a reliable training. The more comparable corpora, the better the results are. Wikipedia is definitely a very good candidate for such an experiment. We report on mass experiments showing significant improvements over a baseline system built from highly similar (almost parallel) text fragments extracted from Wikipedia. The improvements, statistically significant, are related to what we call the level of translational similarity between extracted pairs of sentences. The experiments were performed for three language pairs: Spanish-English, German-English and Romanian-English, based on sentence pairs extracted from the entire dumps of Wikipedia as of December 2012. Our experiments and comparison with similar work show that adding indiscriminately more data to a training corpus is not necessarily a good thing in SMT.

[1]  Dan Stefanescu,et al.  Parallel-Wiki: A Collection of Parallel Sentences Extracted from Wikipedia , 2013, Res. Comput. Sci..

[2]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[3]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[4]  MarcuDaniel,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005 .

[5]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[6]  Inguna Skadina,et al.  Collecting and Using Comparable Corpora for Statistical Machine Translation , 2012, LREC.

[7]  Stefan Daniel Dumitrescu,et al.  Wikipedia as an SMT Training Corpus , 2013, RANLP.

[8]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[9]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[10]  Verginica Barbu Mititelu,et al.  The Romanian wordnet in a nutshell , 2013, Lang. Resour. Evaluation.

[11]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[12]  Dan Tufis Finding Translation Examples for Under-Resourced Language Pairs or for Narrow Domains; the Case for Machine Translation , 2012, Comput. Sci. J. Moldova.

[13]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[14]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[15]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[16]  Stefan Daniel Dumitrescu,et al.  Romanian to English automatic MT experiments at IWSLT12 (system description paper) , 2012, IWSLT.

[17]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.