Filtering and Mining Parallel Data in a Joint Multilingual Space

We learn a joint multilingual sentence embedding and use the distance between sentences in different languages to filter noisy parallel data and to mine for parallel data in large news collections. We are able to improve a competitive baseline on the WMT’14 English to German task by 0.3 BLEU by filtering out 25% of the training data. The same approach is used to mine additional bitexts for the WMT’14 system and to obtain competitive results on the BUCC shared task to identify parallel sentences in comparable corpora. The approach is generic, it can be applied to many language pairs and it is independent of the architecture of the machine translation system.

[1]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[2]  Rico Sennrich,et al.  Improving Neural Machine Translation Models with Monolingual Data , 2015, ACL.

[3]  Christopher D. Manning,et al.  Effective Approaches to Attention-based Neural Machine Translation , 2015, EMNLP.

[4]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[5]  Yoshua Bengio,et al.  A Character-level Decoder without Explicit Segmentation for Neural Machine Translation , 2016, ACL.

[6]  Yonatan Belinkov,et al.  Synthetic and Natural Noise Both Break Neural Machine Translation , 2017, ICLR.

[7]  Achim Rettinger,et al.  Bilingual Word Embeddings from Parallel and Non-parallel Corpora for Cross-Language Text Classification , 2016, NAACL.

[8]  Holger Schwenk,et al.  On the Use of Comparable Corpora to Improve SMT performance , 2009, EACL.

[9]  Thierry Etchegoyhen,et al.  Set-Theoretic Alignment for Comparable Corpora , 2016, ACL.

[10]  Christopher D. Manning,et al.  Learning Distributed Representations for Multilingual Text Sequences , 2015, VS@HLT-NAACL.

[11]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[12]  Pierre Zweigenbaum,et al.  Overview of the Second BUCC Shared Task: Spotting Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[13]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[14]  Josef van Genabith,et al.  An Empirical Analysis of NMT-Derived Interlingual Embeddings and Their Use in Parallel Sentence Identification , 2017, IEEE Journal of Selected Topics in Signal Processing.

[15]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[16]  Martin Wattenberg,et al.  Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation , 2016, TACL.

[17]  Jeff Johnson,et al.  Billion-Scale Similarity Search with GPUs , 2017, IEEE Transactions on Big Data.

[18]  Xiaojun Wan,et al.  Cross-Lingual Sentiment Classification with Bilingual Document Representation Learning , 2016, ACL.

[19]  Noah A. Smith,et al.  The Web as a Parallel Corpus , 2003, CL.

[20]  Dragos Stefan Munteanu,et al.  Improving Machine Translation Performance by Exploiting Non-Parallel Corpora , 2005, CL.

[21]  Phil Blunsom,et al.  Multilingual Models for Compositional Distributed Semantics , 2014, ACL.

[22]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[23]  Sarath Chandar Multilingual Deep Learning , 2013 .

[24]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[25]  Philippe Langlais,et al.  BUCC 2017 Shared Task: a First Attempt Toward a Deep Learning Framework for Identifying Parallel Sentences in Comparable Corpora , 2017, BUCC@ACL.

[26]  Amittai Axelrod,et al.  Data Selection with Cluster-Based Language Difference Models and Cynical Selection , 2019, ArXiv.

[27]  Hitoshi Isahara,et al.  Reliable Measures for Aligning Japanese-English News Articles and Sentences , 2003, ACL.

[28]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[29]  Matthijs Douze,et al.  Learning Joint Multilingual Sentence Representations with Neural Machine Translation , 2017, Rep4NLP@ACL.