The NiuTrans System for WNGT 2020 Efficiency Task

This paper describes the submissions of the NiuTrans Team to the WNGT 2020 Efficiency Shared Task. We focus on the efficient implementation of deep Transformer models (Wang et al., 2019; Li et al., 2019) using NiuTensor, a flexible toolkit for NLP tasks. We explored the combination of deep encoder and shallow decoder in Transformer models via model compression and knowledge distillation. The neural machine translation decoding also benefits from FP16 inference, attention caching, dynamic batching, and batch pruning. Our systems achieve promising results in both translation quality and efficiency, e.g., our fastest system can translate more than 40,000 tokens per second with an RTX 2080 Ti while maintaining 42.9 BLEU on newstest2018.

[1]  Bo Wang,et al.  OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU , 2018, NMT@ACL.

[2]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[3]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[4]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[5]  Jingbo Zhu,et al.  Neural Machine Translation with Joint Representation , 2020, AAAI.

[6]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[7]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[8]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[9]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[10]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[11]  Quoc V. Le,et al.  The Evolved Transformer , 2019, ICML.

[12]  Jingbo Zhu,et al.  Multi-layer Representation Fusion for Neural Machine Translation , 2018, COLING.

[13]  Rico Sennrich,et al.  Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention , 2019, EMNLP.

[14]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[15]  Jingbo Zhu,et al.  Towards Fully 8-bit Integer Inference for the Transformer Model , 2020, IJCAI.

[16]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[17]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[18]  Wei Yi,et al.  Kernel Fusion: An Effective Method for Better Power Efficiency on Multithreaded GPU , 2010, 2010 IEEE/ACM Int'l Conference on Green Computing and Communications & Int'l Conference on Cyber, Physical and Social Computing.

[19]  Andrew McCallum,et al.  Energy and Policy Considerations for Deep Learning in NLP , 2019, ACL.

[20]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[21]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[22]  Yann Dauphin,et al.  Pay Less Attention with Lightweight and Dynamic Convolutions , 2019, ICLR.

[23]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[24]  Markus Freitag,et al.  Beam Search Strategies for Neural Machine Translation , 2017, NMT@ACL.

[25]  Xing Shi,et al.  Speeding Up Neural Machine Translation Decoding by Shrinking Run-time Vocabulary , 2017, ACL.