An Energy-Efficient Sparse Deep-Neural-Network Learning Accelerator With Fine-Grained Mixed Precision of FP8–FP16

Recently, several hardware have been reported for deep-neural-network (DNN) acceleration, however, they focused on only inference rather than DNN learning that is crucial ingredient for user adaptation at the edge-device as well as transfer learning with domain-specific data. However, DNN learning requires much heavier floating-point (FP) computation and memory access than DNN inference, thus, dedicated DNN learning hardware is essential. In this letter, we present an energy-efficient DNN learning accelerator core supporting CNN and FC learning as well as inference with following three key features: 1) fine-grained mixed precision (FGMP); 2) compressed sparse DNN learning/inference; and 3) input load balancer. As a result, energy efficiency is improved <inline-formula> <tex-math notation="LaTeX">$1.76\times $ </tex-math></inline-formula> compared to sparse FP16 operation without any degradation of learning accuracy. The energy efficiency is <inline-formula> <tex-math notation="LaTeX">$4.9\times $ </tex-math></inline-formula> higher than NVIDIA V100 GPU and its normalized peak performance is <inline-formula> <tex-math notation="LaTeX">$3.47\times $ </tex-math></inline-formula> higher than previous DNN learning processor.

[1]  Hoi-Jun Yoo,et al.  UNPU: A 50.6TOPS/W unified deep neural network accelerator with 1b-to-16b fully-variable weight bit-precision , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[2]  Meng-Fan Chang,et al.  Sticker: A 0.41-62.1 TOPS/W 8Bit Neural Network Processor with Multi-Sparsity Compatible Convolution Arrays and Online Tuning Acceleration for Fully Connected Layers , 2018, 2018 IEEE Symposium on VLSI Circuits.

[3]  Vikas Chandra,et al.  Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations , 2017, ArXiv.

[4]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[5]  Hoi-Jun Yoo,et al.  7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16 , 2019, 2019 IEEE International Solid- State Circuits Conference - (ISSCC).

[6]  Swagath Venkataramani,et al.  A Scalable Multi-TeraOPS Core for AI Training and Inference , 2018, IEEE Solid-State Circuits Letters.

[7]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[8]  Tadahiro Kuroda,et al.  QUEST: A 7.49TOPS multi-purpose log-quantized DNN inference engine stacked on 96MB 3D SRAM using inductive-coupling technology in 40nm CMOS , 2018, 2018 IEEE International Solid - State Circuits Conference - (ISSCC).

[9]  Eunhyeok Park,et al.  Energy-Efficient Neural Network Accelerator Based on Outlier-Aware Low-Precision Computation , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).