论文信息 - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Transformer-based language models such as BERT provide significant accuracy improvement to a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimizations for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7 ×, 2.5 ×, and 53 × lower energy compared to the conventional inference without early stopping, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.

[1] Joel Silberman,et al. A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[2] Scott C. Lewis,et al. A 256-Mcell Phase-Change Memory Chip Operating at $2{+}$ Bit/Cell , 2013, IEEE Transactions on Circuits and Systems I: Regular Papers.

[3] Siyuan Lu,et al. Hardware Accelerator for Multi-Head Attention and Position-Wise Feed-Forward in the Transformer , 2020, 2020 IEEE 33rd International System-on-Chip Conference (SOCC).

[4] Jimmy J. Lin,et al. DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference , 2020, ACL.

[5] Yin Yang,et al. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT , 2020, Transactions of the Association for Computational Linguistics.

[6] H.-S. Philip Wong,et al. On-Chip Memory Technology Design Space Explorations for Mobile Deep Neural Network Accelerators , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[7] Hyoukjun Kwon,et al. MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[8] Gu-Yeon Wei,et al. A 16nm 25mm2 SoC with a 54.5x Flexibility-Efficiency Range from Dual-Core Arm Cortex-A53 to eFPGA and Cache-Coherent Accelerators , 2019, 2019 Symposium on VLSI Circuits.

[9] Furu Wei,et al. BERT Loses Patience: Fast and Robust Inference with Early Exit , 2020, NeurIPS.

[10] Deog-Kyoon Jeong,et al. A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020, 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[11] Shih-Arn Hwang,et al. 35.1 An Octa-Core 2.8/2GHz Dual-Gear Sensor-Assisted High-Speed and Power-Efficient CPU in 7nm FinFET 5G Smartphone SoC , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[12] Masahide Matsumoto,et al. A 130.7mm2 2-layer 32Gb ReRAM memory device in 24nm technology , 2013, 2013 IEEE International Solid-State Circuits Conference Digest of Technical Papers.

[13] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[14] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.

[15] Shaoli Liu,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[16] Gu-Yeon Wei,et al. DNN Engine: A 28-nm Timing-Error Tolerant Sparse Deep Neural Network Processor for IoT Applications , 2018, IEEE Journal of Solid-State Circuits.

[17] Asit K. Mishra,et al. From high-level deep neural models to FPGAs , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[18] Jungwook Choi,et al. OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator , 2020, MLSys.

[19] Dong Han,et al. Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[20] Song Han,et al. EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[21] Gu-Yeon Wei,et al. SMAUG , 2019, ACM Trans. Archit. Code Optim..

[22] Xuehai Zhou,et al. PuDianNao: A Polyvalent Machine Learning Accelerator , 2015, ASPLOS.

[23] Geoffrey E. Hinton,et al. Layer Normalization , 2016, ArXiv.

[24] Bruce Jacob,et al. DRAMsim3: A Cycle-Accurate, Thermal-Capable DRAM Simulator , 2020, IEEE Computer Architecture Letters.

[25] Alexander Rush,et al. AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference , 2019, ArXiv.

[26] Andreas Moshovos,et al. GOBO: Quantizing Attention-Based NLP Models for Low Latency and Energy Efficient Inference , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[27] Amar Phanishayee,et al. Gist: Efficient Data Encoding for Deep Neural Network Training , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[28] Omer Levy,et al. Are Sixteen Heads Really Better than One? , 2019, NeurIPS.

[29] H. T. Kung,et al. BranchyNet: Fast inference via early exiting from deep neural networks , 2016, 2016 23rd International Conference on Pattern Recognition (ICPR).

[30] J. Scott McCarley,et al. Pruning a BERT-based Question Answering Model , 2019, ArXiv.

[31] An Open-source Framework for Autonomous SoC Design with Analog Block Generation , 2020, 2020 IFIP/IEEE 28th International Conference on Very Large Scale Integration (VLSI-SOC).

[32] Narayanan Vijaykrishnan,et al. Context-Aware Convolutional Neural Network over Distributed System in Collaborative Computing , 2019, 2019 56th ACM/IEEE Design Automation Conference (DAC).

[33] James Tschanz,et al. 25.1 A Fully Synthesizable Distributed and Scalable All-Digital LDO in 10nm CMOS , 2020, 2020 IEEE International Solid- State Circuits Conference - (ISSCC).

[34] Patrick Hansen,et al. FixyNN: Efficient Hardware for Mobile Computer Vision via Transfer Learning , 2019, ArXiv.

[35] Jian Zhang,et al. SQuAD: 100,000+ Questions for Machine Comprehension of Text , 2016, EMNLP.

[36] Gu-Yeon Wei,et al. MaxNVM: Maximizing DNN Storage Density and Inference Efficiency with Sparse Encoding and Error Mitigation , 2019, MICRO.

[37] Matthew Mattina,et al. Systolic Tensor Array: An Efficient Structured-Sparse GEMM Accelerator for Mobile CNN Inference , 2020, IEEE Computer Architecture Letters.

[38] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.

[39] Christopher Torng,et al. A modular digital VLSI flow for high-productivity SoC design , 2018, DAC.

[40] Norman P. Jouppi,et al. Understanding the trade-offs in multi-level cell ReRAM memory design , 2013, 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC).

[41] Pradeep Dubey,et al. SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[42] Yiming Yang,et al. MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices , 2020, ACL.

[43] Roy Schwartz,et al. The Right Tool for the Job: Matching Model and Instance Complexities , 2020, ACL.

[44] Sergey Ioffe,et al. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[45] Song Han,et al. Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[46] Alexander M. Rush,et al. 9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[47] Forrest N. Iandola,et al. SqueezeBERT: What can computer vision teach NLP about efficient neural networks? , 2020, SUSTAINLP.

[48] Gu-Yeon Wei,et al. A 16-nm Always-On DNN Processor With Adaptive Clocking and Multi-Cycle Banked SRAMs , 2019, IEEE Journal of Solid-State Circuits.

[49] Christoforos E. Kozyrakis,et al. TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators , 2019, ASPLOS.

[50] Hong Wang,et al. An energy-efficient graphics processor featuring fine-grain DVFS with integrated voltage regulators, execution-unit turbo, and retentive sleep in 14nm tri-gate CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[51] Rajesh K. Gupta,et al. SnaPEA: Predictive Early Activation for Reducing Computation in Deep Convolutional Neural Networks , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[52] Jeong Deog-Kyoon,et al. A^3: Accelerating Attention Mechanisms in Neural Networks with Approximation , 2020 .

[53] Gu-Yeon Wei,et al. CHIPKIT: An Agile, Reusable Open-Source Framework for Rapid Test Chip Development , 2020, IEEE Micro.

[54] William J. Dally,et al. SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[55] Edouard Grave,et al. Adaptive Attention Span in Transformers , 2019, ACL.

[56] Jeff Johnson,et al. Rethinking floating point for deep learning , 2018, ArXiv.

[57] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.

[58] Alexander M. Rush,et al. Movement Pruning: Adaptive Sparsity by Fine-Tuning , 2020, NeurIPS.

[59] Rob A. Rutenbar,et al. A 3mm2 Programmable Bayesian Inference Accelerator for Unsupervised Machine Perception using Parallel Gibbs Sampling in 16nm , 2020, 2020 IEEE Symposium on VLSI Circuits.

[60] Kevin Gimpel,et al. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations , 2019, ICLR.

[61] Eunhyeok Park,et al. Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[62] Cong Xu,et al. NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[63] Christopher Torng,et al. INVITED: A Modular Digital VLSI Flow for High-Productivity SoC Design , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[64] Natalie D. Enright Jerger,et al. Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[65] Matthew Mattina,et al. A Systematic Methodology for Characterizing Scalability of DNN Accelerators using SCALE-Sim , 2020, 2020 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS).

[66] Gu-Yeon Wei,et al. Ares: A framework for quantifying the resilience of deep neural networks , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[67] Hadi Esmaeilzadeh,et al. TABLA: A unified template-based framework for accelerating statistical machine learning , 2016, 2016 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[68] Thomas Wolf,et al. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter , 2019, ArXiv.

[69] Thomas Wolf,et al. HuggingFace's Transformers: State-of-the-art Natural Language Processing , 2019, ArXiv.

[70] Ninghui Sun,et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[71] Matthew Mattina,et al. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids , 2020, INTERSPEECH.

[72] William J. Dally,et al. MAGNet: A Modular Accelerator Generator for Neural Networks , 2019, 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[73] Joel Emer,et al. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[74] Gu-Yeon Wei,et al. MEMTI: Optimizing On-Chip Nonvolatile Storage for Visual Multitask Inference at the Edge , 2019, IEEE Micro.

[75] Moshe Wasserblat,et al. Q8BERT: Quantized 8Bit BERT , 2019, 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS).

[76] Michael W. Mahoney,et al. Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT , 2019, AAAI.

[77] Jian Weng,et al. DSAGEN: Synthesizing Programmable Spatial Accelerators , 2020, 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA).

[78] David A. Patterson,et al. In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[79] Yoshua Bengio,et al. Neural Machine Translation by Jointly Learning to Align and Translate , 2014, ICLR.

[80] Jose-Maria Arnau,et al. Computation Reuse in DNNs by Exploiting Input Similarity , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[81] Mengjia Yan,et al. UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[82] Gu-Yeon Wei,et al. Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[83] William J. Dally,et al. Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[84] Meng-Fan Chang,et al. 19.4 embedded 1Mb ReRAM in 28nm CMOS with 0.27-to-1V read using swing-sample-and-couple sense amplifier and self-boost-write-termination scheme , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[85] Kevin G. Stawiasz,et al. 5.2 Distributed system of digitally controlled microregulators enabling per-core DVFS for the POWER8TM microprocessor , 2014, 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC).

[86] Hanrui Wang,et al. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning , 2020, 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA).