OpusFilter: A Configurable Parallel Corpus Filtering Toolbox

This paper introduces OpusFilter, a flexible and modular toolbox for filtering parallel corpora. It implements a number of components based on heuristic filters, language identification libraries, character-based language models, and word alignment tools, and it can easily be extended with custom filters. Bitext segments can be ranked according to their quality or domain match using single features or a logistic regression model that can be trained without manually labeled training data. We demonstrate the effectiveness of OpusFilter on the example of a Finnish-English news translation task based on noisy web-crawled training data. Applying our tool leads to improved translation quality while significantly reducing the size of the training data, also clearly outperforming an alternative ranking given in the crawled data set. Furthermore, we show the ability of OpusFilter to perform data selection for domain adaptation.

[1]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[2]  Philipp Koehn,et al.  Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions , 2019, WMT.

[3]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[4]  Jörg Tiedemann,et al.  Efficient Word Alignment with Markov Chain Monte Carlo , 2016, Prague Bull. Math. Linguistics.

[5]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[6]  Alexander M. Rush,et al.  OpenNMT: Open-Source Toolkit for Neural Machine Translation , 2017, ACL.

[7]  Mikel L. Forcada,et al.  ParaCrawl: Web-scale parallel corpora for the languages of the EU , 2019, MTSummit.

[8]  Jörg Tiedemann,et al.  Parallel Data, Tools and Interfaces in OPUS , 2012, LREC.

[9]  Huda Khayrallah,et al.  Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[10]  Teemu Hirsimäki,et al.  On Growing and Pruning Kneser–Ney Smoothed $ N$-Gram Models , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Philipp Koehn,et al.  Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.

[12]  Jörg Tiedemann,et al.  The University of Helsinki Submission to the WMT19 Parallel Corpus Filtering Task , 2019, WMT.

[13]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[14]  Víctor M. Sánchez-Cartagena,et al.  Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task , 2018, WMT.