论文信息 - Very Deep Transformers for Neural Machine Translation - 字舞流文

Very Deep Transformers for Neural Machine Translation

We explore the application of very deep Transformer models for Neural Machine Translation (NMT). Using a simple yet effective initialization technique that stabilizes training, we show that it is feasible to build standard Transformer-based models with up to 60 encoder layers and 12 decoder layers. These deep models outperform their baseline 6-layer counterparts by as much as 2.5 BLEU, and achieve new state-of-the-art benchmark results on WMT14 English-French (43.8 BLEU and 46.4 BLEU with back-translation) and WMT14 English-German (30.1 BLEU).The code and trained models will be publicly available at: this https URL.

Jianfeng Gao | Kevin Duh | Xiaodong Liu | Liyuan Liu

[1] Danqi Chen,et al. of the Association for Computational Linguistics: , 2001 .

[2] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[3] Ralph Weischedel,et al. A STUDY OF TRANSLATION ERROR RATE WITH TARGETED HUMAN ANNOTATION , 2005 .

[4] Matthew G. Snover,et al. A Study of Translation Edit Rate with Targeted Human Annotation , 2006, AMTA.

[5] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments , 2007, WMT@ACL.

[6] Yoshua Bengio,et al. Understanding the difficulty of training deep feedforward neural networks , 2010, AISTATS.

[7] Alon Lavie,et al. Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability , 2011, ACL.

[8] Jürgen Schmidhuber,et al. Training Very Deep Networks , 2015, NIPS.

[9] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[10] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[11] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[12] Yann LeCun,et al. Very Deep Convolutional Networks for Text Classification , 2016, EACL.

[13] Myle Ott,et al. Understanding Back-Translation at Scale , 2018, EMNLP.

[14] Ankur Bapna,et al. Training Deeper Neural Machine Translation Models with Transparent Attention , 2018, EMNLP.

[15] Di He,et al. Layer-Wise Coordination between Encoder and Decoder for Neural Machine Translation , 2018, NeurIPS.

[16] Matt Post,et al. A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[17] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[18] Myle Ott,et al. Scaling Neural Machine Translation , 2018, WMT.

[19] Tobias Domhan,et al. How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures , 2018, ACL.

[20] Quoc V. Le,et al. The Evolved Transformer , 2019, ICML.

[21] Zhiyuan Zhang,et al. Understanding and Improving Layer Normalization , 2019, NeurIPS.

[22] Myle Ott,et al. Facebook FAIR’s WMT19 News Translation Task Submission , 2019, WMT.

[23] Alexandra Birch,et al. The University of Edinburgh's Submissions to the WMT19 News Translation Task , 2019, WMT.

[24] Yonatan Belinkov,et al. Analysis Methods in Neural Language Processing: A Survey , 2018, TACL.

[25] Toan Q. Nguyen,et al. Transformers without Tears: Improving the Normalization of Self-Attention , 2019, IWSLT.

[26] Marcin Junczys-Dowmunt,et al. Microsoft Translator at WMT 2019: Towards Large-Scale Document-Level Neural Machine Translation , 2019, WMT.

[27] Graham Neubig,et al. compare-mt: A Tool for Holistic Comparison of Language Generation Systems , 2019, NAACL.

[28] Jingbo Zhu,et al. Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[29] Tao Qin,et al. Depth Growing for Neural Machine Translation , 2019, ACL.

[30] Yong Cheng,et al. Robust Neural Machine Translation with Doubly Adversarial Inputs , 2019, ACL.

[31] Yann Dauphin,et al. Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[32] Myle Ott,et al. fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[33] Yuanzhi Li,et al. Backward Feature Correction: How Deep Learning Performs Deep Learning , 2020, ArXiv.

[34] Jianfeng Gao,et al. Adversarial Training for Large Neural Language Models , 2020, ArXiv.

[35] Liyuan Liu,et al. On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[36] Mark Chen,et al. Language Models are Few-Shot Learners , 2020, NeurIPS.

[37] Jiawei Han,et al. Understanding the Difficulty of Training Transformers , 2020, EMNLP.

[38] Jianfeng Gao,et al. Deep Learning--based Text Classification , 2020, ACM Comput. Surv..