GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference

Attention-based models have demonstrated remarkable success in various natural language understanding tasks. However, efficient execution remains a challenge for these models which are memory-bound due to their massive number of parameters. We present GOBO, a model quantization technique that compresses the vast majority (typically 99.9%) of the 32-bit floating-point parameters of state-of-the-art BERT models and their variants to 3 bits while maintaining their accuracy. Unlike other quantization methods, GOBO does not require fine-tuning nor retraining to compensate for the quantization error. We present two practical hardware applications of GOBO. In the first GOBO reduces memory storage and traffic and as a result inference latency and energy consumption. This GOBO memory compression mechanism is plug-in compatible with many architectures; we demonstrate it with the TPU, Eyeriss, and an architecture using Tensor Cores-like units. Second, we present a co-designed hardware architecture that also reduces computation. Uniquely, the GOBO architecture maintains most of the weights in 3b even during computation, a property that: (i) makes the processing elements area efficient, allowing us to pack more compute power per unit area, (ii) replaces most multiply-accumulations with additions, and (iii) reduces the off-chip traffic by amplifying on-chip memory capacity.

[1]  Omer Levy,et al.  SpanBERT: Improving Pre-training by Representing and Predicting Spans , 2019, TACL.

[2]  Martin Andrews,et al.  Transformer to CNN: Label-scarce distillation for efficient text classification , 2019, ArXiv.

[3]  Bruce Jacob,et al.  DRAMSim2: A Cycle Accurate Memory System Simulator , 2011, IEEE Computer Architecture Letters.

[4]  Yu Cheng,et al.  Patient Knowledge Distillation for BERT Model Compression , 2019, EMNLP.

[5]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[6]  Jimmy J. Lin,et al.  Distilling Task-Specific Knowledge from BERT into Simple Neural Networks , 2019, ArXiv.

[7]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[8]  Kurt Keutzer,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2020, AAAI.

[9]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, ArXiv.

[10]  Anna Rumshisky,et al.  Revealing the Dark Secrets of BERT , 2019, EMNLP.

[11]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[12]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[13]  R'emi Louf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[14]  Jimmy J. Lin,et al.  Natural Language Generation for Effective Knowledge Distillation , 2019, EMNLP.

[15]  Shaohuai Shi,et al.  Understanding Top-k Sparsification in Distributed Deep Learning , 2019, ArXiv.

[16]  Song Han,et al.  HAT: Hardware-Aware Transformers for Efficient Natural Language Processing , 2020, ACL.

[17]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[18]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[19]  Edouard Grave,et al.  Reducing Transformer Depth on Demand with Structured Dropout , 2019, ICLR.

[20]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[21]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[22]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[23]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[24]  Tao Zhang,et al.  A Survey of Model Compression and Acceleration for Deep Neural Networks , 2017, ArXiv.

[25]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[26]  Yves Scherrer,et al.  Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation , 2020, EMNLP.

[27]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[28]  Yang Song,et al.  Extreme Language Model Compression with Optimal Subwords and Shared Projections , 2019, ArXiv.

[29]  Qun Liu,et al.  TinyBERT: Distilling BERT for Natural Language Understanding , 2020, EMNLP.

[30]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[31]  Tor M. Aamodt,et al.  Modeling Deep Learning Accelerator Enabled GPUs , 2018, 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[32]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[33]  Marco Maggioni,et al.  Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking , 2018, ArXiv.

[34]  Wayne Luk,et al.  Deep Neural Network Approximation for Custom Hardware , 2019, ACM Comput. Surv..

[35]  David Thorsley,et al.  Post-training Piecewise Linear Quantization for Deep Neural Networks , 2020, ECCV.

[36]  Eunhyeok Park,et al.  Value-aware Quantization for Training and Inference of Neural Networks , 2018, ECCV.

[37]  Joel Silberman,et al.  A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference , 2018, 2018 IEEE Symposium on VLSI Circuits.