Efficient 8-Bit Quantization of Transformer Neural Machine Language Translation Model

In this work, we quantize a trained Transformer machine language translation model leveraging INT8/VNNI instructions in the latest Intel$^\circledR$ Xeon$^\circledR$ Cascade Lake processors to improve inference performance while maintaining less than 0.5$\%$ drop in accuracy. To the best of our knowledge, this is the first attempt in the industry to quantize the Transformer model. This has high impact as it clearly demonstrates the various complexities of quantizing the language translation model. We present novel quantization techniques directly in TensorFlow to opportunistically replace 32-bit floating point (FP32) computations with 8-bit integers (INT8) and transform the FP32 computational graph. We also present a bin-packing parallel batching technique to maximize CPU utilization. Overall, our optimizations with INT8/VNNI deliver 1.5X improvement over the best FP32 performance. Furthermore, it reveals the opportunities and challenges to boost performance of quantized deep learning inference and establishes best practices to run inference with high efficiency on Intel CPUs.

[1]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[2]  James T. Kwok,et al.  Loss-aware Weight Quantization of Deep Networks , 2018, ICLR.

[3]  Pradeep Dubey,et al.  Ternary Neural Networks with Fine-Grained Quantization , 2017, ArXiv.

[4]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[5]  Jian Cheng,et al.  Quantized Convolutional Neural Networks for Mobile Devices , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  D. W. Barron Machine Translation , 1968, Nature.

[7]  Huaiyu Zhu On Information and Sufficiency , 1997 .

[8]  Yoshua Bengio,et al.  BinaryNet: Training Deep Neural Networks with Weights and Activations Constrained to +1 or -1 , 2016, ArXiv.

[9]  Tara N. Sainath,et al.  State-of-the-Art Speech Recognition with Sequence-to-Sequence Models , 2017, 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Yuxing Peng,et al.  Mnemonic Reader for Machine Comprehension , 2017, ArXiv.

[11]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[12]  Shuang Wu,et al.  Training and Inference with Integers in Deep Neural Networks , 2018, ICLR.

[13]  Soheil Ghiasi,et al.  Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks , 2018, IEEE Transactions on Neural Networks and Learning Systems.

[14]  Tarek M. Taha,et al.  Effective Quantization Approaches for Recurrent Neural Networks , 2018, 2018 International Joint Conference on Neural Networks (IJCNN).

[15]  Percy Liang,et al.  Generating Sentences by Editing Prototypes , 2017, TACL.