RaPiD: AI Accelerator for Ultra-low Precision Training and Inference

The growing prevalence and computational demands of Artificial Intelligence (AI) workloads has led to widespread use of hardware accelerators in their execution. Scaling the performance of AI accelerators across generations is pivotal to their success in commercial deployments. The intrinsic error-resilient nature of AI workloads present a unique opportunity for performance/energy improvement through precision scaling. Motivated by the recent algorithmic advances in precision scaling for inference and training, we designed RaPiD1, a 4-core AI accelerator chip supporting a spectrum of precisions, namely, 16 and 8-bit floating-point and 4 and 2-bit fixed-point. The 36mm2 RaPiD chip fabricated in 7nm EUV technology delivers a peak 3.5 TFLOPS/W in HFP8 mode and 16.5 TOPS/W in INT4 mode at nominal voltage. Using a performance model calibrated to within 1% of the measurement results, we evaluated DNN inference using 4-bit fixed-point representation for a 4-core 1 RaPiD chip system and DNN training using 8-bit floating point representation for a 768 TFLOPs AI system comprising 4 32-core RaPiD chips. Our results show INT4 inference for batch size of 1 achieves 3 - 13.5 (average 7) TOPS/W and FP8 training for a mini-batch of 512 achieves a sustained 102 - 588 (average 203) TFLOPS across a wide range of applications.

Joel Silberman | Swagath Venkataramani | Matthew M. Ziegler | Mingu Kang | Moriyoshi Ohara | Vijayalakshmi Srinivasan | Mauricio J. Serrano | Monodeep Kar | Ashish Ranjan | Sanchari Sen | Shubham Jain | Sunil Shukla | Kailash Gopalakrishnan | Jungwook Choi | Jinwook Oh | Jinwook Jung | Chia-Yu Chen | Kazuaki Ishizaki | Leland Chang | Allison Allain | Nianzheng Cao | Wei Wang | Brian W. Curran | Vidhi Zalani | Bruce M. Fleischer | Alberto Mannari | Marcel Schaal | Ching Zhou | Kyu-Hyoun Kim | Ankur Agrawal | Zhibin Ren | Kerstin Schelm | Michael Guillorn | Howard Haynie | Eri Ogawa | James Bonanno | Robert Casatuta | Scot Rider | Naigang Wang | Xiao Sun | Jintao Zhang | Hoang Tran | Yulong Li | Hiroshi Inoue | Matthew Cohen | Siyu Koswatta | Sae Kyu Lee | Martin Lutz | Silvia Mueller | Michael Scheuermann | Jie Yang | Xin Zhang | Vinay Shah | Pong-Fei Lu | A. Agrawal | K. Gopalakrishnan | Chia-Yu Chen | Jungwook Choi | Jinwook Oh | W. Wang | J. Silberman | Shubham Jain | P. Lu | S. Mueller | Xiao Sun | Naigang Wang | Swagath Venkataramani | S. Koswatta | Leland Chang | V. Srinivasan | M. Scheuermann | Z. Ren | B. Fleischer | H. Inoue | M. Serrano | Mingu Kang | Jinwook Jung | M. Ziegler | Jie-quan Yang | Monodeep Kar | Ashish Ranjan | Kazuaki Ishizaki | Moriyoshi Ohara | Sanchari Sen | Jintao Zhang | B. Curran | Alberto Mannari | H. Tran | Yulong Li | Eri Ogawa | M. Schaal | Allison Allain | J. Bonanno | N. Cao | Robert Casatuta | Matthew Cohen | Michael Guillorn | Howard Haynie | Kyu-Hyoun Kim | Martin Lutz | Scot Rider | Kerstin Schelm | V. Zalani | Xin Zhang | Ching Zhou | Vinay Shah | Sunil Shukla | K. Ishizaki | Vijayalakshmi Srinivasan | Pong-Fei Lu

[1]  Kevin Duh,et al.  Compressing BERT: Studying the Effects of Weight Pruning on Transfer Learning , 2020, RepL4NLP@ACL.

[2]  Swagath Venkataramani,et al.  PACT: Parameterized Clipping Activation for Quantized Neural Networks , 2018, ArXiv.

[3]  Xiangyu Zhang,et al.  ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[4]  Kaushik Roy,et al.  Quality programmable vector processors for approximate computing , 2013, 2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[5]  Quoc V. Le,et al.  Neural Architecture Search with Reinforcement Learning , 2016, ICLR.

[6]  Joel Silberman,et al.  A Scalable Multi- TeraOPS Deep Learning Processor Core for AI Trainina and Inference , 2018, 2018 IEEE Symposium on VLSI Circuits.

[7]  Berin Martini,et al.  NeuFlow: A runtime reconfigurable dataflow processor for vision , 2011, CVPR 2011 WORKSHOPS.

[8]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[9]  Dan Alistarh,et al.  Model compression via distillation and quantization , 2018, ICLR.

[10]  Forrest N. Iandola,et al.  SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <1MB model size , 2016, ArXiv.

[11]  Hiroshi Inoue,et al.  DeepTools: Compiler and Execution Runtime Extensions for RaPiD AI Accelerator , 2019, IEEE Micro.

[12]  Michael Behar,et al.  Spring Hill (NNP-I 1000) Intel’s Data Center Inference Chip , 2019, 2019 IEEE Hot Chips 31 Symposium (HCS).

[13]  Dejan S. Milojicic,et al.  PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference , 2019, ASPLOS.

[14]  Jian Sun,et al.  Deep Learning with Low Precision by Half-Wave Gaussian Quantization , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[16]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[17]  Lukasz Kaiser,et al.  Attention is All you Need , 2017, NIPS.

[18]  Trevor Darrell,et al.  Sequence to Sequence -- Video to Text , 2015, 2015 IEEE International Conference on Computer Vision (ICCV).

[19]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[20]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[21]  Erich Elsen,et al.  Deep Speech: Scaling up end-to-end speech recognition , 2014, ArXiv.

[22]  Jason Cong,et al.  Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks , 2015, FPGA.

[23]  Swagath Venkataramani,et al.  Memory and Interconnect Optimizations for Peta-Scale Deep Learning Systems , 2019, 2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC).

[24]  Srihari Cadambi,et al.  A Massively Parallel, Energy Efficient Programmable Accelerator for Learning and Classification , 2012, TACO.

[25]  Natalia Gimelshein,et al.  Virtualizing Deep Neural Networks for Memory-Efficient Neural Network Design , 2016, ArXiv.

[26]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[27]  Joel Emer,et al.  Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks , 2016, CARN.

[28]  Yiran Chen,et al.  PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[30]  Suyog Gupta,et al.  To prune, or not to prune: exploring the efficacy of pruning for model compression , 2017, ICLR.

[31]  Xiang Zhang,et al.  Text Understanding from Scratch , 2015, ArXiv.

[32]  Jun Yao,et al.  A CGRA-Based Approach for Accelerating Convolutional Neural Networks , 2015, 2015 IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip.

[33]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[34]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[35]  Baochun Li,et al.  Spotlight: Optimizing Device Placement for Training Deep Neural Networks , 2018, ICML.

[36]  Vinay P. Namboodiri,et al.  Multi-layer Pruning Framework for Compressing Single Shot MultiBox Detector , 2018, 2019 IEEE Winter Conference on Applications of Computer Vision (WACV).

[37]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[38]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]  Swagath Venkataramani,et al.  A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference , 2020, 2020 IEEE Symposium on VLSI Circuits.

[40]  Vincent Vanhoucke,et al.  Improving the speed of neural networks on CPUs , 2011 .

[41]  Margo I. Seltzer,et al.  Towards General-Purpose Neural Network Computing , 2015, 2015 International Conference on Parallel Architecture and Compilation (PACT).

[42]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[43]  James T. Kwok,et al.  Loss-aware Weight Quantization of Deep Networks , 2018, ICLR.

[44]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[45]  Swagath Venkataramani,et al.  DyHard-DNN: Even More DNN Acceleration with Dynamic Hardware Reconfiguration , 2018, 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC).

[46]  Charbel Sakr,et al.  Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks , 2019, ICLR.

[47]  Eric S. Chung,et al.  A Configurable Cloud-Scale DNN Processor for Real-Time AI , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[48]  Samy Bengio,et al.  Device Placement Optimization with Reinforcement Learning , 2017, ICML.

[49]  Forrest N. Iandola,et al.  FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[50]  Yu Cao,et al.  Throughput-Optimized OpenCL-based FPGA Accelerator for Large-Scale Convolutional Neural Networks , 2016, FPGA.

[51]  Fei-Fei Li,et al.  Large-Scale Video Classification with Convolutional Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition.

[52]  Zhuo Wang,et al.  In-Memory Computation of a Machine-Learning Classifier in a Standard 6T SRAM Array , 2017, IEEE Journal of Solid-State Circuits.

[53]  Berin Martini,et al.  A 240 G-ops/s Mobile Coprocessor for Deep Neural Networks , 2014, 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops.

[54]  Xiang Zhang,et al.  OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks , 2013, ICLR.

[55]  Tao Zhang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[56]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[57]  Tara N. Sainath,et al.  Deep Neural Networks for Acoustic Modeling in Speech Recognition , 2012 .

[58]  Jae-Joon Han,et al.  Learning to Quantize Deep Networks by Optimizing Quantization Intervals With Task Loss , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[59]  Vivienne Sze,et al.  Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks , 2017, IEEE Journal of Solid-State Circuits.

[60]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[61]  Wei Liu,et al.  SSD: Single Shot MultiBox Detector , 2015, ECCV.

[62]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[63]  Daniel Brand,et al.  Training Deep Neural Networks with 8-bit Floating Point Numbers , 2018, NeurIPS.

[64]  Song Han,et al.  Trained Ternary Quantization , 2016, ICLR.

[65]  Anke Schmeink,et al.  Variational Network Quantization , 2018, ICLR.

[66]  Josep Torrellas,et al.  SAVE: Sparsity-Aware Vector Engine for Accelerating DNN Training and Inference on CPUs , 2020, 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[67]  Alaa R. Alameldeen,et al.  ZCOMP: Reducing DNN Cross-Layer Memory Footprint Using Vector Extensions , 2019, MICRO.

[68]  Gu-Yeon Wei,et al.  Minerva: Enabling Low-Power, Highly-Accurate Deep Neural Network Accelerators , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[69]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[70]  William J. Dally,et al.  Simba: Scaling Deep-Learning Inference with Multi-Chip-Module-Based Architecture , 2019, MICRO.

[71]  Joel Silberman,et al.  A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling , 2021, 2021 IEEE International Solid- State Circuits Conference (ISSCC).

[72]  Swagath Venkataramani,et al.  Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks , 2019, NeurIPS.

[73]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[74]  Marc'Aurelio Ranzato,et al.  Large Scale Distributed Deep Networks , 2012, NIPS.

[75]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[76]  Anand Raghunathan,et al.  SparCE: Sparsity Aware General-Purpose Core Extensions to Accelerate Deep Neural Networks , 2017, IEEE Transactions on Computers.

[77]  Pradeep Dubey,et al.  Distributed Deep Learning Using Synchronous Stochastic Gradient Descent , 2016, ArXiv.

[78]  Bo Chen,et al.  MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications , 2017, ArXiv.

[79]  Srihari Cadambi,et al.  A dynamically configurable coprocessor for convolutional neural networks , 2010, ISCA.

[80]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[81]  George Kurian,et al.  Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation , 2016, ArXiv.

[82]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[83]  Dong Han,et al.  Cambricon: An Instruction Set Architecture for Neural Networks , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).