The Volctrans GLAT System: Non-autoregressive Translation Meets WMT21

This paper describes the Volctrans’ submission to the WMT21 news translation shared task for German->English translation. We build a parallel (i.e., non-autoregressive) translation system using the Glancing Transformer, which enables fast and accurate parallel decoding in contrast to the currently prevailing autoregressive models. To the best of our knowledge, this is the first parallel translation system that can be scaled to such a practical scenario like WMT competition. More importantly, our parallel translation system achieves the best BLEU score (35.0) on German->English translation task, outperforming all strong autoregressive counterparts.

[1]  Meng Sun,et al.  Baidu Neural Machine Translation Systems for WMT19 , 2019, WMT.

[2]  Iryna Gurevych,et al.  Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks , 2019, EMNLP.

[3]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[4]  Xiao Pan,et al.  The Volctrans Machine Translation System for WMT20 , 2020, WMT@EMNLP.

[5]  Yang Feng,et al.  Bridging the Gap between Training and Inference for Neural Machine Translation , 2019, ACL.

[6]  Victor O. K. Li,et al.  Non-Autoregressive Neural Machine Translation , 2017, ICLR.

[7]  Rico Sennrich,et al.  Domain, Translationese and Noise in Synthetic Data for Neural Machine Translation , 2019, ArXiv.

[8]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[9]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[10]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[11]  Enhong Chen,et al.  Joint Training for Neural Machine Translation Models with Monolingual Data , 2018, AAAI.

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Noah A. Smith,et al.  A Simple, Fast, and Effective Reparameterization of IBM Model 2 , 2013, NAACL.

[14]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[15]  Frank Hutter,et al.  Decoupled Weight Decay Regularization , 2017, ICLR.

[16]  Jie Zhou,et al.  WeChat Neural Machine Translation Systems for WMT20 , 2020, WMT.

[17]  Yu Bao,et al.  Glancing Transformer for Non-Autoregressive Neural Machine Translation , 2020, ArXiv.

[18]  Zaixiang Zheng,et al.  Vocabulary Learning via Optimal Transport for Neural Machine Translation , 2021, ACL/IJCNLP.

[19]  Jürgen Schmidhuber,et al.  Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks , 2006, ICML.

[20]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..