Masked Translation Model

We introduce the masked translation model (MTM) which combines encoding and decoding of sequences within the same model component. The MTM is based on the idea of masked language modeling and supports both autoregressive and nonautoregressive decoding strategies by simply changing the order of masking. In experiments on the WMT 2016 Romanian→English task, the MTM shows strong constant-time translation performance, beating all related approaches with comparable complexity. We also extensively compare various decoding strategies supported by the MTM, as well as several length modeling techniques and training settings.

[1]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[2]  Eduard Hovy,et al.  FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow , 2019, EMNLP.

[3]  Guillaume Lample,et al.  Cross-lingual Language Model Pretraining , 2019, NeurIPS.

[4]  Qun Liu,et al.  Deep Neural Machine Translation with Linear Associative Unit , 2017, ACL.

[5]  Hermann Ney,et al.  Jane: Open Source Hierarchical Translation, Extended with Reordering and Lexicon Models , 2010, WMT@ACL.

[6]  Michael R. Lyu,et al.  Information Aggregation for Multi-Head Attention with Routing-by-Agreement , 2019, NAACL.

[7]  Di He,et al.  Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[8]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[9]  Omer Levy,et al.  Constant-Time Machine Translation with Conditional Masked Language Models , 2019, IJCNLP 2019.

[10]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[11]  Donald Geman,et al.  Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images , 1984, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[12]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[13]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[14]  Marta R. Costa-jussà,et al.  Joint Source-Target Self Attention with Locality Constraints , 2019, ArXiv.

[15]  Hang Li,et al.  “ Tony ” DNN Embedding for “ Tony ” Selective Read for “ Tony ” ( a ) Attention-based Encoder-Decoder ( RNNSearch ) ( c ) State Update s 4 SourceVocabulary Softmax Prob , 2016 .

[16]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[17]  David Barber,et al.  Generative Neural Machine Translation , 2018, NeurIPS.

[18]  Wei Xu,et al.  Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation , 2016, TACL.

[19]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[20]  Jindrich Libovický,et al.  End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[21]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[22]  Di He,et al.  Dense Information Flow for Neural Machine Translation , 2018, NAACL.

[23]  Hermann Ney,et al.  Returnn: The RWTH extensible training framework for universal recurrent neural networks , 2016, 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[24]  Alex Wang,et al.  BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model , 2019, Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation.

[25]  Ankur Bapna,et al.  The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[26]  Min Zhang,et al.  Variational Neural Machine Translation , 2016, EMNLP.

[27]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[28]  Hermann Ney,et al.  Towards Two-Dimensional Sequence to Sequence Model in Neural Machine Translation , 2018, EMNLP.

[29]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[30]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[31]  Yang Feng,et al.  Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation , 2019, ACL.

[32]  Chris Callison-Burch,et al.  Open Source Toolkit for Statistical Machine Translation: Factored Translation Models and Lattice Decoding , 2006 .

[33]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[34]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[35]  Xing Wang,et al.  Modeling Recurrence for Transformer , 2019, NAACL.

[36]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.