The NiuTrans System for the WMT 2021 Efficiency Task

This paper describes the NiuTrans system for the WMT21 translation efficiency task. Following last year’s work, we explore various techniques to improve the efficiency while maintaining translation quality. We investigate the combinations of lightweight Transformer architectures and knowledge distillation strategies. Also, we improve the translation efficiency with graph optimization, low precision, dynamic batching, and parallel pre/post-processing. Putting these together, our system can translate 247,000 words per second on an NVIDIA A100, being 3\times faster than our last year’s system. Our system is the fastest and has the lowest memory consumption on the GPU-throughput track. The code, model, and pipeline will be available at NiuTrans.NMT.

[1]  Jingbo Zhu,et al.  The NiuTrans System for WNGT 2020 Efficiency Task , 2020, NGT@ACL.

[2]  Jingbo Zhu,et al.  An Efficient Transformer Decoder with Compressed Sub-layers , 2021, AAAI.

[3]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.

[4]  Geoffrey E. Hinton,et al.  Distilling the Knowledge in a Neural Network , 2015, ArXiv.

[5]  Endong Wang,et al.  Intel Math Kernel Library , 2014 .

[6]  Jingbo Zhu,et al.  Shallow-to-Deep Training for Neural Machine Translation , 2020, EMNLP.

[7]  Jingbo Zhu,et al.  Training Flexible Depth Model by Multi-Task Learning for Neural Machine Translation , 2020, EMNLP.

[8]  Jingbo Zhu,et al.  Weight Distillation: Transferring the Knowledge in Neural Network Parameters , 2021, ACL/IJCNLP.

[9]  Ashish Vaswani,et al.  Self-Attention with Relative Position Representations , 2018, NAACL.

[10]  Matt Post,et al.  A Call for Clarity in Reporting BLEU Scores , 2018, WMT.

[11]  Mikhail Smelyanskiy,et al.  FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference , 2021, ArXiv.

[12]  Jingbo Zhu,et al.  Learning Deep Transformer Models for Machine Translation , 2019, ACL.

[13]  Yelong Shen,et al.  A Simple but Tough-to-Beat Data Augmentation Approach for Natural Language Understanding and Generation , 2020, ArXiv.

[14]  Jingbo Zhu,et al.  RankNAS: Efficient Neural Architecture Search by Pairwise Ranking , 2021, EMNLP.

[15]  Myle Ott,et al.  Understanding Back-Translation at Scale , 2018, EMNLP.

[16]  Jingbo Zhu,et al.  Learning Light-Weight Translation Models from Deep Transformer , 2020, AAAI.

[17]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[18]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[19]  Myle Ott,et al.  fairseq: A Fast, Extensible Toolkit for Sequence Modeling , 2019, NAACL.

[20]  Philipp Koehn,et al.  Moses: Open Source Toolkit for Statistical Machine Translation , 2007, ACL.

[21]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT19 , 2019, WMT.

[22]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[23]  Lukasz Kaiser,et al.  Universal Transformers , 2018, ICLR.

[24]  Jingbo Zhu,et al.  The NiuTrans Machine Translation Systems for WMT20 , 2021, WMT.

[25]  Jingbo Zhu,et al.  Bag of Tricks for Optimizing Transformer Efficiency , 2021, EMNLP.

[26]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[27]  Rico Sennrich,et al.  Neural Machine Translation of Rare Words with Subword Units , 2015, ACL.

[28]  Jingbo Zhu,et al.  Towards Fully 8-bit Integer Inference for the Transformer Model , 2020, IJCAI.

[29]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.