Efficient Machine Translation with Model Pruning and Quantization

We participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.

[1]  Kenneth Heafield,et al.  Pruning Neural Machine Translation for Speed Using Group Lasso , 2021, WMT.

[2]  Philipp Koehn,et al.  Scalable Modified Kneser-Ney Language Model Estimation , 2013, ACL.

[3]  Alexandre Allauzen,et al.  Continuous Space Translation Models with Neural Networks , 2012, NAACL.

[4]  Antonio Valerio Miceli Barone,et al.  The University of Edinburgh’s English-German and English-Hausa Submissions to the WMT21 News Translation Task , 2021, WMT.

[5]  Taku Kudo,et al.  SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing , 2018, EMNLP.

[6]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[7]  Kenneth Heafield,et al.  Edinburgh’s Submissions to the 2020 Machine Translation Efficiency Task , 2020, NGT.

[8]  Kenneth Heafield,et al.  Compressing Neural Machine Translation Models with 4-bit Precision , 2020, NGT@ACL.

[9]  Fully Synthetic Data Improves Neural Machine Translation with Knowledge Distillation , 2020, 2012.15455.

[10]  Barukh Ziv,et al.  Lower Numerical Precision Deep Learning Inference and Training , 2018 .

[11]  André F. T. Martins,et al.  Marian: Fast Neural Machine Translation in C++ , 2018, ACL.

[12]  Richard M. Schwartz,et al.  Fast and Robust Neural Network Joint Models for Statistical Machine Translation , 2014, ACL.

[13]  José A. R. Fonollosa,et al.  Smooth Bilingual N-Gram Translation , 2007, EMNLP.

[14]  Ioannis Konstas,et al.  Findings of the Fourth Workshop on Neural Generation and Translation , 2020, NGT@ACL.

[15]  Kushal Datta,et al.  Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model , 2019, ArXiv.

[16]  Salim Roukos,et al.  Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.

[17]  Marcin Junczys-Dowmunt,et al.  From Research to Production and Back: Ludicrously Fast Neural Machine Translation , 2019, EMNLP.