论文信息 - BiBERT: Accurate Fully Binarized BERT

BiBERT: Accurate Fully Binarized BERT

The large pre-trained BERT has achieved remarkable performance on Natural Language Processing (NLP) tasks but is also computation and memory expensive. As one of the powerful compression approaches, binarization extremely reduces the computation and memory consumption by utilizing 1-bit parameters and bitwise operations. Unfortunately, the full binarization of BERT (i.e., 1-bit weight, embedding, and activation) usually suffer a significant performance drop, and there is rare study addressing this problem. In this paper, with the theoretical justification and empirical analysis, we identify that the severe performance drop can be mainly attributed to the information degradation and optimization direction mismatch respectively in the forward and backward propagation, and propose BiBERT, an accurate fully binarized BERT, to eliminate the performance bottlenecks. Specifically, BiBERT introduces an efficient Bi-Attention structure for maximizing representation information statistically and a Direction-Matching Distillation (DMD) scheme to optimize the full binarized BERT accurately. Extensive experiments show that BiBERT outperforms both the straightforward baseline and existing state-of-the-art quantized BERTs with ultra-low bit activations by convincing margins on the NLP benchmark. As the first fully binarized BERT, our method yields impressive 56.3 times and 31.2 times saving on FLOPs and model size, demonstrating the vast advantages and potential of the fully binarized BERT model in real-world resource-constrained scenarios.

[1] Ling Shao,et al. ReCU: Reviving the Dead Weights in Binary Neural Networks , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).

[2] Jie Zhou,et al. Learning Efficient Binarized Object Detectors With Information Compression , 2021, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[3] Michael R. Lyu,et al. BinaryBERT: Pushing the Limit of BERT Quantization , 2020, ACL.

[4] Qun Liu,et al. TernaryBERT: Distillation-aware Ultra-low Bit BERT , 2020, EMNLP.

[5] Andreas Moshovos,et al. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[6] Xin Jiang,et al. DynaBERT: Dynamic BERT with Adaptive Width and Depth , 2020, NeurIPS.

[7] Georgios Tzimiropoulos,et al. Training Binary Neural Networks with Real-to-Binary Convolutions , 2020, ICLR.

[8] Li Dong,et al. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers , 2020, NeurIPS.

[9] Xipeng Qiu,et al. Improving BERT Fine-Tuning via Self-Ensemble and Self-Distillation , 2020, Journal of Computer Science and Technology.

[10] Mitchell A. Gordon,et al. Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, REPL4NLP.

[11] J. Scott McCarley,et al. Structured Pruning of a BERT-based Question Answering Model , 2019 .

[12] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[13] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[14] Jingkuan Song,et al. Forward and Backward Information Retention for Accurate Binary Neural Networks , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[15] Xin Jiang,et al. TinyBERT: Distilling BERT for Natural Language Understanding , 2019, FINDINGS.

[16] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[17] Yangming Li,et al. A Stack-Propagation Framework with Token-Level Intent Detection for Spoken Language Understanding , 2019, EMNLP.

[18] Greg Mori,et al. Similarity-Preserving Knowledge Distillation , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).

[19] Jesse Vig,et al. A Multiscale Visualization of Attention in the Transformer Model , 2019, ACL.

[20] Deliang Fan,et al. Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network Using Truncated Gaussian Approximation , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[21] Daniel Soudry,et al. Post training 4-bit quantization of convolutional networks for rapid-deployment , 2018, NeurIPS.

[22] Wei Liu,et al. Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm , 2018, ECCV.

[23] Yang Liu,et al. Two-Step Quantization for Low-bit Neural Networks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[24] Samuel R. Bowman,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[25] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[26] Quoc V. Le,et al. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[27] Shuchang Zhou,et al. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients , 2016, ArXiv.

[28] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[29] Ali Farhadi,et al. XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks , 2016, ECCV.

[30] Yoshua Bengio,et al. Attention-Based Models for Speech Recognition , 2015, NIPS.

[31] Yoshua Bengio,et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention , 2015, ICML.

[32] Yoshua Bengio,et al. Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation , 2013, ArXiv.

[33] David G. Messerschmitt,et al. Quantizing for maximum output entropy (Corresp.) , 1971, IEEE Trans. Inf. Theory.

[34] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[35] H. M. Walker. DE MOIVRE ON THE LAW OF NORMAL PROBABILITY , 2006 .