论文信息 - Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Mask-Predict: Parallel Decoding of Conditional Masked Language Models

Most machine translation systems generate text autoregressively from left to right. We, instead, use a masked language modeling objective to train a model to predict any subset of the target words, conditioned on both the input text and a partially masked target translation. This approach allows for efficient iterative decoding, where we first predict all of the target words non-autoregressively, and then repeatedly mask out and regenerate the subset of words that the model is least confident about. By applying this strategy for a constant number of iterations, our model improves state-of-the-art performance levels for non-autoregressive and parallel decoding translation models by over 4 BLEU on average. It is also able to reach within about 1 BLEU point of a typical left-to-right transformer model, while decoding significantly faster.

[1] Xu Tan,et al. MASS: Masked Sequence to Sequence Pre-training for Language Generation , 2019, ICML.

[2] Jason Lee,et al. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[3] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[4] Jakob Uszkoreit,et al. Insertion Transformer: Flexible Sequence Generation via Insertion Operations , 2019, ICML.

[5] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[7] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Graham Neubig,et al. compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.

[10] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[11] Jindrich Libovický,et al. End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[12] Alex Wang,et al. BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[13] Victor O. K. Li,et al. Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[14] Jürgen Schmidhuber,et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[15] Qi Liu,et al. Insertion-based Decoding with Automatically Inferred Generation Order , 2019, Transactions of the Association for Computational Linguistics.

[16] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[17] Guillaume Lample,et al. Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[18] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[19] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.