Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

Machine translation is highly sensitive to the size and quality of the training data, which has led to an increasing interest in collecting and filtering large parallel corpora. In this paper, we propose a new method for this task based on multilingual sentence embeddings. In contrast to previous approaches, which rely on nearest neighbor retrieval with a hard threshold over cosine similarity, our proposed method accounts for the scale inconsistencies of this measure, considering the margin between a given sentence pair and its closest candidates instead. Our experiments show large improvements over existing methods. We outperform the best published results on the BUCC mining task and the UN reconstruction task by more than 10 F1 and 30 precision points, respectively. Filtering the English-German ParaCrawl corpus with our approach, we obtain 31.2 BLEU points on newstest2014, an improvement of more than one point over the best official filtered version.

[1]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[2]  Keith Stevens,et al.  Effective Parallel Corpus Mining using Bilingual Sentence Embeddings , 2018, WMT.

[3]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[4]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[5]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[6]  Josef van Genabith,et al.  An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification , 2017, IEEE Journal of Selected Topics in Signal Processing.

[7]  Philipp Koehn,et al.  Europarl: A Parallel Corpus for Statistical Machine Translation , 2005, MTSUMMIT.

[8]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[9]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[10]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.

[11]  Dragos Stefan Munteanu,et al.  Extracting Parallel Sub-Sentential Fragments from Non-Parallel Corpora , 2006, ACL.

[12]  Jakob Uszkoreit,et al.  Large Scale Parallel Document Mining for Machine Translation , 2010, COLING.

[13]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[14]  Lei Shi,et al.  A DOM Tree Alignment Model for Mining Parallel Data from the Web , 2006, ACL.

[15]  Thierry Etchegoyhen,et al.  Weighted Set-Theoretic Alignment of Comparable Sentences , 2017, BUCC@ACL.

[16]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[17]  Pierre Zweigenbaum,et al.  zNLP: Identifying Parallel Sentences in Chinese-English Comparable Corpora , 2017, BUCC@ACL.

[18]  Philipp Koehn,et al.  Zipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora , 2017, EMNLP.

[19]  Marcin Junczys-Dowmunt,et al.  The United Nations Parallel Corpus v1.0 , 2016, LREC.

[20]  Philip Resnik,et al.  Mining the Web for Bilingual Text , 1999, ACL.

[21]  Richard Socher,et al.  Weighted Transformer Network for Machine Translation , 2017, ArXiv.

[22]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[23]  Philipp Koehn,et al.  Six Challenges for Neural Machine Translation , 2017, NMT@ACL.

[24]  Thierry Etchegoyhen,et al.  Set-Theoretic Alignment for Comparable Corpora , 2016, ACL.

[25]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[26]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[27]  Philippe Langlais,et al.  BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[28]  Holger Schwenk,et al.  Filtering and Mining Parallel Data in a Joint Multilingual Space , 2018, ACL.

[29]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[30]  Hitoshi Isahara,et al.  Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[31]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[32]  Christopher C. Yang,et al.  Mining English/Chinese Parallel Documents from the World Wide Web , 2002 .

[33]  Guillaume Lample,et al.  Word Translation Without Parallel Data , 2017, ICLR.