EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimizations for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7 ×, 2.5 ×, and 53 × lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

[1]  Joel Silberman,et al.  A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[2]  Scott C. Lewis,et al.  A 256-Mcell Phase-Change Memory Chip Operating at $2{+}$ Bit/Cell , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[3]  Siyuan Lu,et al.  Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer , 2020, 2020 IEEE 33rd International System-on-Chip Conference (SOCC).

[4]  Jimmy J. Lin,et al.  DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference , 2020, ACL.

[5]  Yin Yang,et al.  Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.

[6]  H.-S. Philip Wong,et al.  On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[7]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[8]  Gu-Yeon Wei,et al.  A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators , 2019, 2019 Symposium on VLSI Circuits.

[9]  Furu Wei,et al.  BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[10]  Deog-Kyoon Jeong,et al.  A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11]  Shih-Arn Hwang,et al.  35.1 An Octa-Core 2.8/2GHz Dual-Gear Sensor-Assisted High-Speed and Power-Efficient CPU in 7nm FinFET 5G Smartphone SoC , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[12]  Masahide Matsumoto,et al.  A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm technology , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[13]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14]  Omer Levy,et al.  GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[15]  Shaoli Liu,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16]  Gu-Yeon Wei,et al.  DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications , 2018, IEEE Journal of Solid-State Circuits.

[17]  Asit K. Mishra,et al.  From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18]  Jungwook Choi,et al.  OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator , 2020, MLSys.

[19]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[21]  Gu-Yeon Wei,et al.  SMAUG , 2019, ACM Trans. Archit. Code Optim..

[22]  Xuehai Zhou,et al.  PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[23]  Geoffrey E. Hinton,et al.  Layer Normalization , 2016, ArXiv.

[24]  Bruce Jacob,et al.  DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator , 2020, IEEE Computer Architecture Letters.

[25]  Alexander Rush,et al.  AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference , 2019, ArXiv.

[26]  Andreas Moshovos,et al.  GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27]  Amar Phanishayee,et al.  Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[28]  Omer Levy,et al.  Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[29]  H. T. Kung,et al.  BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[30]  J. Scott McCarley,et al.  Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[31]  An Open-source Framework for Autonomous SoC Design with Analog Block Generation , 2020, 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC).

[32]  Narayanan Vijaykrishnan,et al.  Context-Aware Convolutional Neural Network over Distributed System in Collaborative Computing , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[33]  James Tschanz,et al.  25.1 A Fully Synthesizable Distributed and Scalable All-Digital LDO in 10nm CMOS , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[34]  Patrick Hansen,et al.  FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning , 2019, ArXiv.

[35]  Jian Zhang,et al.  SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36]  Gu-Yeon Wei,et al.  MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[37]  Matthew Mattina,et al.  Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference , 2020, IEEE Computer Architecture Letters.

[38]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[39]  Christopher Torng,et al.  A modular digital VLSI flow for high-productivity SoC design , 2018, DAC.

[40]  Norman P. Jouppi,et al.  Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[41]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42]  Yiming Yang,et al.  MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[43]  Roy Schwartz,et al.  The Right Tool for the Job: Matching Model and Instance Complexities , 2020, ACL.

[44]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[46]  Alexander M. Rush,et al.  9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[47]  Forrest N. Iandola,et al.  SqueezeBERT: What can computer vision teach NLP about efficient neural networks? , 2020, SUSTAINLP.

[48]  Gu-Yeon Wei,et al.  A 16-nm Always-On DNN Processor With Adaptive Clocking and Multi-Cycle Banked SRAMs , 2019, IEEE Journal of Solid-State Circuits.

[49]  Christoforos E. Kozyrakis,et al.  TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[50]  Hong Wang,et al.  An energy-efficient graphics processor featuring fine-grain DVFS with integrated voltage regulators, execution-unit turbo, and retentive sleep in 14nm tri-gate CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[51]  Rajesh K. Gupta,et al.  SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[52]  Jeong Deog-Kyoon,et al.  A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020 .

[53]  Gu-Yeon Wei,et al.  CHIPKIT: An Agile, Reusable Open-Source Framework for Rapid Test Chip Development , 2020, IEEE Micro.

[54]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[55]  Edouard Grave,et al.  Adaptive Attention Span in Transformers , 2019, ACL.

[56]  Jeff Johnson,et al.  Rethinking floating point for deep learning , 2018, ArXiv.

[57]  Omer Levy,et al.  RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[58]  Alexander M. Rush,et al.  Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[59]  Rob A. Rutenbar,et al.  A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm , 2020, 2020 IEEE Symposium on VLSI Circuits.

[60]  Kevin Gimpel,et al.  ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[61]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[62]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[63]  Christopher Torng,et al.  INVITED: A Modular Digital VLSI Flow for High-Productivity SoC Design , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[64]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[65]  Matthew Mattina,et al.  A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim , 2020, 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[66]  Gu-Yeon Wei,et al.  Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[67]  Hadi Esmaeilzadeh,et al.  TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[68]  Thomas Wolf,et al.  DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[69]  Thomas Wolf,et al.  HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[70]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[71]  Matthew Mattina,et al.  TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[72]  William J. Dally,et al.  MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[73]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[74]  Gu-Yeon Wei,et al.  MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge , 2019, IEEE Micro.

[75]  Moshe Wasserblat,et al.  Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[76]  Michael W. Mahoney,et al.  Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[77]  Jian Weng,et al.  DSAGEN: Synthesizing Programmable Spatial Accelerators , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[78]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[79]  Yoshua Bengio,et al.  Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[80]  Jose-Maria Arnau,et al.  Computation Reuse in DNNs by Exploiting Input Similarity , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[81]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[82]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[83]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[84]  Meng-Fan Chang,et al.  19.4 embedded 1Mb ReRAM in 28nm CMOS with 0.27-to-1V read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[85]  Kevin G. Stawiasz,et al.  5.2 Distributed system of digitally controlled microregulators enabling per-core DVFS for the POWER8TM microprocessor , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[86]  Hanrui Wang,et al.  SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).