Unsupervised Parallel Corpus Mining on Web Data

With a large amount of parallel data, neural machine translation systems are able to deliver human-level performance for sentence-level translation. However, it is costly to label a large amount of parallel data by humans. In contrast, there is a large-scale of parallel corpus created by humans on the Internet. The major difficulty to utilize them is how to filter them out from the noise website environments. Current parallel data mining methods all require labeled parallel data as the training source. In this paper, we present a pipeline to mine the parallel corpus from the Internet in an unsupervised manner. On the widely used WMT'14 English-French and WMT'16 English-German benchmarks, the machine translator trained with the data extracted by our pipeline achieves very close performance to the supervised results. On the WMT'16 English-Romanian and Romanian-English benchmarks, our system produces new state-of-the-art results, 39.81 and 38.95 BLEU scores, even compared with supervised approaches.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  Houda Bouamor,et al.  H2@BUCC18: Parallel Sentence Extraction from Comparable Corpora Using Multilingual Sentence Embeddings , 2018, BUCC@LREC.

[3]  Philipp Koehn,et al.  Factored Translation Models , 2007, EMNLP.

[4]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[5]  Miquel Espl,et al.  Bitextor, a free/open-source software to harvest translation memories from multilingual websites , 2009 .

[6]  Holger Schwenk,et al.  Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings , 2018, ACL.

[7]  Mikel L. Forcada,et al.  ParaCrawl: Web-scale parallel corpora for the languages of the EU , 2019, MTSummit.

[8]  Philipp Koehn,et al.  Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , 2007 .

[9]  András Kornai,et al.  Parallel corpora for medium density languages , 2007 .

[10]  Víctor M. Sánchez-Cartagena,et al.  Prompsit’s submission to WMT 2018 Parallel Corpus Filtering shared task , 2018, WMT.

[11]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[12]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[13]  Xu Tan,et al.  MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[14]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.