An Unsupervised System for Parallel Corpus Filtering

In this paper we describe LMU Munich’s submission for the WMT 2018 Parallel Corpus Filtering shared task which addresses the problem of cleaning noisy parallel corpora. The task of mining and cleaning parallel sentences is important for improving the quality of machine translation systems, especially for low-resource languages. We tackle this problem in a fully unsupervised fashion relying on bilingual word embeddings created without any bilingual signal. After pre-filtering noisy data we rank sentence pairs by calculating bilingual sentence-level similarities and then remove redundant data by employing monolingual similarity as well. Our unsupervised system achieved good performance during the official evaluation of the shared task, scoring only a few BLEU points behind the best systems, while not requiring any parallel training data.

[1]  Fabienne Braune,et al.  Evaluating bilingual word embeddings on the long tail , 2018, NAACL-HLT.

[2]  Philipp Koehn,et al.  Findings of the 2014 Workshop on Statistical Machine Translation , 2014, WMT@ACL.

[3]  Huda Khayrallah,et al.  Findings of the WMT 2018 Shared Task on Parallel Corpus Filtering , 2018, WMT.

[4]  Tomas Mikolov,et al.  Enriching Word Vectors with Subword Information , 2016, TACL.

[5]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[6]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[7]  Mamoru Komachi,et al.  Building a Monolingual Parallel Corpus for Text Simplification Using Sentence Similarity Based on Alignment between Word Embeddings , 2016, COLING.

[8]  Matt J. Kusner,et al.  From Word Embeddings To Document Distances , 2015, ICML.

[9]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[10]  Philipp Koehn,et al.  Quick and Reliable Document Alignment via TF/IDF-weighted Cosine Distance , 2016, WMT.

[11]  Mikel L. Forcada,et al.  Combining Content-Based and URL-Based Heuristics to Harvest Aligned Bitexts from Multilingual Sites with Bitextor , 2010, Prague Bull. Math. Linguistics.

[12]  Lemao Liu,et al.  Instance Weighting for Neural Machine Translation Domain Adaptation , 2017, EMNLP.

[13]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[14]  John Riedl,et al.  Item-based collaborative filtering recommendation algorithms , 2001, WWW '01.

[15]  Alon Lavie,et al.  The CMU-Avenue French-English Translation System , 2012, WMT@NAACL-HLT.

[16]  Philipp Koehn,et al.  Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.

[17]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.