Deep Encoder, Shallow Decoder: Reevaluating the Speed-Quality Tradeoff in Machine Translation

State-of-the-art neural machine translation models generate outputs autoregressively, where every step conditions on the previously generated tokens. This sequential nature causes inherent decoding latency. Non-autoregressive translation techniques, on the other hand, parallelize generation across positions and speed up inference at the expense of translation quality. Much recent effort has been devoted to non-autoregressive methods, aiming for a better balance between speed and quality. In this work, we re-examine the trade-off and argue that transformer-based autoregressive models can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a one-layer autoregressive decoder yields state-of-the-art accuracy with comparable latency to strong non-autoregressive models. Our findings suggest that the latency disadvantage for autoregressive translation has been overestimated due to a suboptimal choice of layer allocation, and we provide a new speed-quality baseline for future research toward fast, accurate translation.

[1]  Jason Lee,et al.  Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement , 2018, EMNLP.

[2]  Timothy Dozat,et al.  Simpler but More Accurate Semantic Dependency Parsing , 2018, ACL.

[3]  Lijun Wu,et al.  Achieving Human Parity on Automatic Chinese to English News Translation , 2018, ArXiv.

[4]  Jungo Kasai,et al.  Parallel Machine Translation with Disentangled Context Transformer , 2020, ICML 2020.

[5]  Zhi-Hong Deng,et al.  Fast Structured Decoding for Sequence Models , 2019, NeurIPS.

[6]  Jakob Uszkoreit,et al.  KERMIT: Generative Insertion-Based Modeling for Sequences , 2019, ArXiv.

[7]  Omer Levy,et al.  Semi-Autoregressive Training Improves Mask-Predict Decoding , 2020, ArXiv.

[8]  Omer Levy,et al.  Mask-Predict: Parallel Decoding of Conditional Masked Language Models , 2019, EMNLP.

[9]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[10]  Xing Shi,et al.  Speeding Up Neural Machine Translation Decoding by Shrinking Run-time Vocabulary , 2017, ACL.

[11]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[12]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[13]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[14]  Jindrich Libovický,et al.  End-to-End Non-Autoregressive Neural Machine Translation with Connectionist Temporal Classification , 2018, EMNLP.

[15]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[16]  Omer Levy,et al.  Aligned Cross Entropy for Non-Autoregressive Machine Translation , 2020, ICML.

[17]  Lior Wolf,et al.  Using the Output Embedding to Improve Language Models , 2016, EACL.

[18]  Noah A. Smith,et al.  You May Not Need Attention , 2018, ArXiv.

[19]  Atsushi Fujita,et al.  Recurrent Stacking of Layers for Compact Neural Machine Translation Models , 2018, AAAI.

[20]  Omer Levy,et al.  BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , 2019, ACL.

[21]  Bo Wang,et al.  OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU , 2018, NMT@ACL.

[22]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[23]  Jiawei Zhou,et al.  Improving Non-autoregressive Neural Machine Translation with Monolingual Data , 2020, ACL.

[24]  Qi Liu,et al.  Insertion-based Decoding with Automatically Inferred Generation Order , 2019, Transactions of the Association for Computational Linguistics.

[25]  Alex Wang,et al.  A Generalized Framework of Sequence Generation with Application to Undirected Sequence Models , 2019, ArXiv.

[26]  Jakob Uszkoreit,et al.  An Empirical Study of Generation Order for Machine Translation , 2019, EMNLP.

[27]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[28]  Natalia Gimelshein,et al.  PyTorch: An Imperative Style, High-Performance Deep Learning Library , 2019, NeurIPS.

[29]  Tie-Yan Liu,et al.  Hint-Based Training for Non-Autoregressive Machine Translation , 2019, EMNLP.

[30]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[31]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[32]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[33]  Jakob Uszkoreit,et al.  Insertion Transformer: Flexible Sequence Generation via Insertion Operations , 2019, ICML.

[34]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[35]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[36]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[37]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[38]  Di He,et al.  Non-Autoregressive Machine Translation with Auxiliary Regularization , 2019, AAAI.

[39]  Lukasz Kaiser,et al.  Reformer: The Efficient Transformer , 2020, ICLR.

[40]  Kyunghyun Cho,et al.  Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior , 2019, AAAI.

[41]  Mohammad Norouzi,et al.  Non-Autoregressive Machine Translation with Latent Alignments , 2020, EMNLP.

[42]  Eliyahu Kiperwasser,et al.  Simple and Accurate Dependency Parsing Using Bidirectional LSTM Feature Representations , 2016, TACL.

[43]  Marjan Ghazvininejad,et al.  Multilingual Denoising Pre-training for Neural Machine Translation , 2020, Transactions of the Association for Computational Linguistics.

[44]  Rico Sennrich,et al.  Deep architectures for Neural Machine Translation , 2017, WMT.

[45]  Yoshua Bengio,et al.  On the Properties of Neural Machine Translation: Encoder–Decoder Approaches , 2014, SSST@EMNLP.

[46]  Eduard Hovy,et al.  FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow , 2019, EMNLP.

[47]  Jie Zhou,et al.  Minimizing the Bag-of-Ngrams Difference for Non-Autoregressive Neural Machine Translation , 2019, AAAI.

[48]  Lifu Tu,et al.  ENGINE: Energy-Based Inference Networks for Non-Autoregressive Machine Translation , 2020, ACL.

[49]  Jiwei Li,et al.  LAVA NAT: A Non-Autoregressive Translation Model with Look-Around Decoding and Vocabulary Attention , 2020, ArXiv.

[50]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[51]  Jie Zhou,et al.  Guiding Non-Autoregressive Neural Machine Translation Decoding with Reordering Information , 2019, ArXiv.

[52]  Deyi Xiong,et al.  Accelerating Neural Transformer via an Average Attention Network , 2018, ACL.

[53]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[54]  Hakan Inan,et al.  Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling , 2016, ICLR.

[55]  Timothy Dozat,et al.  Deep Biaffine Attention for Neural Dependency Parsing , 2016, ICLR.

[56]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[57]  Jakob Uszkoreit,et al.  Blockwise Parallel Decoding for Deep Autoregressive Models , 2018, NeurIPS.

[58]  Myle Ott,et al.  Scaling Neural Machine Translation , 2018, WMT.

[59]  Aurko Roy,et al.  Fast Decoding in Sequence Models using Discrete Latent Variables , 2018, ICML.

[60]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[61]  Jungo Kasai,et al.  End-to-End Graph-Based TAG Parsing with Neural Networks , 2018, NAACL.

[62]  Myle Ott,et al.  Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[63]  Changhan Wang,et al.  Levenshtein Transformer , 2019, NeurIPS.

[64]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.