Large SMT data-sets extracted from Wikipedia

The article presents experiments on mining Wikipedia for extracting SMT useful sentence pairs in three language pairs. Each extracted sentence pair is associated with a cross-lingual lexical similarity score based on which, several evaluations have been conducted to estimate the similarity thresholds which allow the extraction of the most useful data for training three-language pairs SMT systems. The experiments showed that for a similarity score higher than 0.7 all sentence pairs in the three language pairs were fully parallel. However, including in the training sets less parallel sentence pairs (that is with a lower similarity score) showed significant improvements in the translation quality (BLEU-based evaluations). The optimized SMT systems were evaluated on unseen test-sets also extracted from Wikipedia. As one of the main goals of our work was to help Wikipedia contributors to translate (with as little post editing as possible) new articles from major languages into less resourced languages and vice-versa, we call this type of translation experiments “in-genre” translation. As in the case of “in-domain” translation, our evaluations showed that using only “in-genre” training data for translating same genre new texts is better than mixing the training data with “out-of-genre” (even) parallel texts.

[1]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[2]  Stefan Daniel Dumitrescu,et al.  Romanian to English automatic MT experiments at IWSLT12 (system description paper) , 2012, IWSLT.

[3]  Stephan Vogel,et al.  Parallel Implementations of Word Alignment Tool , 2008, SETQALNLP.

[4]  Sabine Hunsicker,et al.  Hybrid Parallel Sentence Mining from Comparable Corpora , 2012, EAMT.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[7]  Pablo Gamallo Otero,et al.  Wikipedia as Multilingual Source of Comparable Corpora , 2011 .

[8]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[9]  Verginica Barbu Mititelu,et al.  The Romanian wordnet in a nutshell , 2013, Lang. Resour. Evaluation.

[10]  Stefan Daniel Dumitrescu,et al.  Wikipedia as an SMT Training Corpus , 2013, RANLP.

[11]  Maarten de Rijke,et al.  Finding Similar Sentences across Multiple Languages in Wikipedia , 2006 .

[12]  Stefan Daniel Dumitrescu,et al.  Wiki-Translator: Multilingual Experiments for In-Domain Translations , 2013, Comput. Sci. J. Moldova.

[13]  Tomaz Erjavec,et al.  The JRC-Acquis: A Multilingual Aligned Parallel Corpus with 20+ Languages , 2006, LREC.

[14]  Mehdi Mohammadi,et al.  Building Bilingual Parallel Corpora Based on Wikipedia , 2010, 2010 Second International Conference on Computer Engineering and Applications.

[15]  Inguna Skadina,et al.  Collecting and Using Comparable Corpora for Statistical Machine Translation , 2012, LREC.

[16]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[17]  Dan Stefanescu,et al.  Parallel-Wiki: A Collection of Parallel Sentences Extracted from Wikipedia , 2013, Res. Comput. Sci..

[18]  Kristina Toutanova,et al.  Extracting Parallel Sentences from Comparable Corpora using Document Level Alignment , 2010, NAACL.

[19]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.