论文信息 - Neural Machine Translation with 4-Bit Precision and Beyond

Neural Machine Translation with 4-Bit Precision and Beyond

Neural Machine Translation (NMT) is resource intensive. We design a quantization procedure to compress NMT models better for devices with limited hardware capability. Because most neural network parameters are near zero, we employ logarithmic quantization in lieu of fixed-point quantization. However, we find bias terms are less amenable to log quantization but note they comprise a tiny fraction of the model, so we leave them uncompressed. We also propose to use an error-feedback mechanism during retraining, to preserve the compressed model as a stale gradient. We empirically show that NMT models based on Transformer or RNN architecture can be compressed up to 4-bit precision without any noticeable quality degradation. Models can be compressed up to binary precision, albeit with lower quality. The RNN architecture seems to be more robust to quantization, compared to the Transformer.

Kenneth Heafield | Alham Fikri Aji

[1] Ran El-Yaniv,et al. Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , 2016, J. Mach. Learn. Res..

[2] Wonyong Sung,et al. Fixed point optimization of deep convolutional neural networks for object recognition , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[3] Sachin S. Talathi,et al. Fixed Point Quantization of Deep Convolutional Networks , 2015, ICML.

[4] Bo Chen,et al. Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[5] Kenneth Heafield,et al. Sparse Communication for Distributed Gradient Descent , 2017, EMNLP.

[6] Rico Sennrich,et al. The University of Edinburgh’s Neural MT Systems for WMT17 , 2017, WMT.

[7] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[8] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[9] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10] Ankur Bapna,et al. The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation , 2018, ACL.

[11] William J. Dally,et al. Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training , 2017, ICLR.