论文信息 - Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation - 字舞流文

Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation

State-of-the-art neural machine translation models generate outputs autoregressively, where every step conditions on the previously generated tokens. This sequential nature causes inherent decoding latency. Non-autoregressive translation techniques, on the other hand, parallelize generation across positions and speed up inference at the expense of translation quality. Much recent effort has been devoted to non-autoregressive methods, aiming for a better balance between speed and quality. In this work, we re-examine the trade-off and argue that transformer-based autoregressive models can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a one-layer autoregressive decoder yields state-of-the-art accuracy with comparable latency to strong non-autoregressive models. Our findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and we provide a new speed-quality baseline for future research toward fast, accurate translation.

Jungo Kasai | Nikolaos Pappas | Hao Peng | James Cross | Noah A. Smith

[1] Jason Lee,et al. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[2] Timothy Dozat,et al. Simpler but More Accurate Semantic Dependency Parsing , 2018, ACL.

[3] Lijun Wu,et al. Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[4] Jungo Kasai,et al. Parallel Machine Translation with Disentangled Context Transformer , 2020, ICML 2020.

[5] Zhi-Hong Deng,et al. Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[6] Jakob Uszkoreit,et al. KERMIT: Generative Insertion-Based Modeling for Sequences , 2019, ArXiv.

[7] Omer Levy,et al. Semi-Autoregressive Training Improves Mask-Predict Decoding , 2020, ArXiv.

[8] Omer Levy,et al. Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[9] Lukasz Kaiser,et al. Universal Transformers , 2018, ICLR.

[10] Xing Shi,et al. Speeding Up Neural Machine Translation Decoding by Shrinking Run-time Vocabulary , 2017, ACL.

[11] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[12] Hao Wu,et al. Mixed Precision Training , 2017, ICLR.

[13] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[14] Jindrich Libovický,et al. End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[15] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[16] Omer Levy,et al. Aligned Cross Entropy for Non-Autoregressive Machine Translation , 2020, ICML.

[17] Lior Wolf,et al. Using the Output Embedding to Improve Language Models , 2016, EACL.

[18] Noah A. Smith,et al. You May Not Need Attention , 2018, ArXiv.

[19] Atsushi Fujita,et al. Recurrent Stacking of Layers for Compact Neural Machine Translation Models , 2018, AAAI.

[20] Omer Levy,et al. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[21] Bo Wang,et al. OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU , 2018, NMT@ACL.

[22] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[23] Jiawei Zhou,et al. Improving Non-autoregressive Neural Machine Translation with Monolingual Data , 2020, ACL.

[24] Qi Liu,et al. Insertion-based Decoding with Automatically Inferred Generation Order , 2019, Transactions of the Association for Computational Linguistics.

[25] Alex Wang,et al. A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models , 2019, ArXiv.

[26] Jakob Uszkoreit,et al. An Empirical Study of Generation Order for Machine Translation , 2019, EMNLP.

[27] Marcin Junczys-Dowmunt,et al. From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[28] Natalia Gimelshein,et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[29] Tie-Yan Liu,et al. Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[30] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[31] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33] Jakob Uszkoreit,et al. Insertion Transformer: Flexible Sequence Generation via Insertion Operations , 2019, ICML.

[34] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[35] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[37] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38] Di He,et al. Non-Autoregressive Machine Translation with Auxiliary Regularization , 2019, AAAI.

[39] Lukasz Kaiser,et al. Reformer: The Efficient Transformer , 2020, ICLR.

[40] Kyunghyun Cho,et al. Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[41] Mohammad Norouzi,et al. Non-Autoregressive Machine Translation with Latent Alignments , 2020, EMNLP.

[42] Eliyahu Kiperwasser,et al. Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[43] Marjan Ghazvininejad,et al. Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[44] Rico Sennrich,et al. Deep architectures for Neural Machine Translation , 2017, WMT.

[45] Yoshua Bengio,et al. On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[46] Eduard Hovy,et al. FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow , 2019, EMNLP.

[47] Jie Zhou,et al. Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation , 2019, AAAI.

[48] Lifu Tu,et al. ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation , 2020, ACL.

[49] Jiwei Li,et al. LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention , 2020, ArXiv.

[50] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[51] Jie Zhou,et al. Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information , 2019, ArXiv.

[52] Deyi Xiong,et al. Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[53] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54] Hakan Inan,et al. Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[55] Timothy Dozat,et al. Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[56] Noah A. Smith,et al. A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[57] Jakob Uszkoreit,et al. Blockwise Parallel Decoding for Deep Autoregressive Models , 2018, NeurIPS.

[58] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[59] Aurko Roy,et al. Fast Decoding in Sequence Models using Discrete Latent Variables , 2018, ICML.

[60] Rico Sennrich,et al. Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[61] Jungo Kasai,et al. End-to-End Graph-Based TAG Parsing with Neural Networks , 2018, NAACL.

[62] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[63] Changhan Wang,et al. Levenshtein Transformer , 2019, NeurIPS.

[64] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.