Denoising Neural Machine Translation Training with Trusted Data and Online Data Selection

Measuring domain relevance of data and identifying or selecting well-fit domain data for machine translation (MT) is a well-studied topic, but denoising is not yet. Denoising is concerned with a different type of data quality and tries to reduce the negative impact of data noise on MT training, in particular, neural MT (NMT) training. This paper generalizes methods for measuring and selecting data for domain MT and applies them to denoising NMT training. The proposed approach uses trusted data and a denoising curriculum realized by online data selection. Intrinsic and extrinsic evaluations of the approach show its significant effectiveness for NMT to train on data with severe noise.

[1]  Philipp Koehn,et al.  Statistical Significance Tests for Machine Translation Evaluation , 2004, EMNLP.

[2]  Burr Settles,et al.  Active Learning Literature Survey , 2009 .

[3]  Jason Weston,et al.  Curriculum learning , 2009, ICML '09.

[4]  Spyridon Matsoukas,et al.  Discriminative Corpus Weight Estimation for Machine Translation , 2009, EMNLP.

[5]  William D. Lewis,et al.  Intelligent Selection of Language Model Training Data , 2010, ACL.

[6]  Daphne Koller,et al.  Self-Paced Learning for Latent Variable Models , 2010, NIPS.

[7]  Jianfeng Gao,et al.  Domain Adaptation via Pseudo In-Domain Data Selection , 2011, EMNLP.

[8]  Hermann Ney,et al.  Combining translation and language model scoring for domain-specific data filtering , 2011, IWSLT.

[9]  Nagarajan Natarajan,et al.  Learning with Noisy Labels , 2013, NIPS.

[10]  George F. Foster,et al.  Bilingual Methods for Adaptive Training Data Selection for Machine Translation , 2016, AMTA.

[11]  Fei Huang,et al.  Semi-supervised Convolutional Networks for Translation Adaptation with Tiny Amount of In-domain Data , 2016, CoNLL.

[12]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[13]  Yonatan Belinkov,et al.  Neural Machine Translation Training in a Multi-Domain Scenario , 2017, IWSLT.

[14]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[15]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[16]  Mohammed Mediani Learning from Noisy Data in Statistical Machine Translation , 2017 .

[17]  Quoc V. Le,et al.  Effective Domain Mixing for Neural Machine Translation , 2017, WMT.

[18]  Lemao Liu,et al.  Instance Weighting for Neural Machine Translation Domain Adaptation , 2017, EMNLP.

[19]  Christof Monz,et al.  Dynamic Data Selection for Neural Machine Translation , 2017, EMNLP.

[20]  Marine Carpuat,et al.  Identifying Semantic Divergences in Parallel Text without Annotations , 2018, NAACL.

[21]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[22]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[23]  Huda Khayrallah,et al.  On the Impact of Various Types of Noise on Neural Machine Translation , 2018, NMT@ACL.

[24]  Guillaume Lample,et al.  Unsupervised Machine Translation Using Monolingual Corpora Only , 2017, ICLR.