SPRING: A Sparsity-Aware Reduced-Precision Monolithic 3D CNN Accelerator Architecture for Training and Inference

CNNs outperform traditional machine learning algorithms across a wide range of applications. However, their computational complexity makes it necessary to design efficient hardware accelerators. Most CNN accelerators focus on exploring dataflow styles that exploit computational parallelism. However, potential performance speedup from sparsity has not been adequately addressed. The computation and memory footprint of CNNs can be significantly reduced if sparsity is exploited in network evaluations. To take advantage of sparsity, some accelerator designs explore sparsity encoding and evaluation on CNN accelerators. However, sparsity encoding is just performed on activation or weight and only in inference. It has been shown that activation and weight also have high sparsity levels during training. Hence, sparsity-aware computation should also be considered in training. To further improve performance and energy efficiency, some accelerators evaluate CNNs with limited precision. However, this is limited to the inference since reduced precision sacrifices network accuracy if used in training. In addition, CNN evaluation is usually memory-intensive, especially in training. In this paper, we propose SPRING, a SParsity-aware Reduced-precision Monolithic 3D CNN accelerator for trainING and inference. SPRING supports both CNN training and inference. It uses a binary mask scheme to encode sparsities in activation and weight. It uses the stochastic rounding algorithm to train CNNs with reduced precision without accuracy loss. To alleviate the memory bottleneck in CNN evaluation, especially in training, SPRING uses an efficient monolithic 3D NVM interface to increase memory bandwidth. Compared to GTX 1080 Ti, SPRING achieves 15.6X, 4.2X and 66.0X improvements in performance, power reduction, and energy efficiency, respectively, for CNN training, and 15.5X, 4.5X and 69.1X improvements for inference.

[1]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[2]  Martin D. Schatz,et al.  Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications , 2018, ArXiv.

[3]  F. Clermidy,et al.  3D sequential integration opportunities and technology optimization , 2014, IEEE International Interconnect Technology Conference.

[4]  Dongrui Fan,et al.  Accelerating CNN Algorithm with Fine-Grained Dataflow Architectures , 2018, 2018 IEEE 20th International Conference on High Performance Computing and Communications; IEEE 16th International Conference on Smart City; IEEE 4th International Conference on Data Science and Systems (HPCC/SmartCity/DSS).

[5]  Li Fei-Fei,et al.  Progressive Neural Architecture Search , 2017, ECCV.

[6]  Bo-Cheng Lai,et al.  Supporting compressed-sparse activations and weights on SIMD-like accelerator for sparse convolutional neural networks , 2018, 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC).

[7]  Dajiang Zhou,et al.  Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks , 2017, Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017.

[8]  Tianshi Chen,et al.  ShiDianNao: Shifting vision processing closer to the sensor , 2015, 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA).

[9]  Vijay Vasudevan,et al.  Learning Transferable Architectures for Scalable Image Recognition , 2017, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[10]  Tao Wang,et al.  Image Classification at Supercomputer Scale , 2018, ArXiv.

[11]  Natalie D. Enright Jerger,et al.  Cnvlutin: Ineffectual-Neuron-Free Deep Neural Network Computing , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[12]  Yuanzhou Yang,et al.  Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes , 2018, ArXiv.

[13]  Sudhakar Yalamanchili,et al.  Neurocube: A Programmable Digital Neuromorphic Architecture with High-Density 3D Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[14]  Andreas Moshovos,et al.  Bit-Pragmatic Deep Neural Network Computing , 2016, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[15]  Hyoukjun Kwon,et al.  MAERI: Enabling Flexible Dataflow Mapping over DNN Accelerators via Reconfigurable Interconnects , 2018, ASPLOS.

[16]  Jinjun Xiong,et al.  Application-Transparent Near-Memory Processing Architecture with Memory Channel Network , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[17]  Sepp Hochreiter,et al.  The Vanishing Gradient Problem During Learning Recurrent Neural Nets and Problem Solutions , 1998, Int. J. Uncertain. Fuzziness Knowl. Based Syst..

[18]  Pritish Narayanan,et al.  Deep Learning with Limited Numerical Precision , 2015, ICML.

[19]  Wonyong Sung,et al.  FPGA based implementation of deep neural networks using on-chip memory only , 2016, 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[20]  Tajana Rosing,et al.  NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration , 2019, IEEE Transactions on Computers.

[21]  Jarrod A. Roy,et al.  Capo: robust and scalable open-source min-cut floorplacer , 2005, ISPD '05.

[22]  Richard Vuduc,et al.  Automatic performance tuning of sparse matrix kernels , 2003 .

[23]  Diederik Verkest,et al.  Physical Design Solutions to Tackle FEOL/BEOL Degradation in Gate-level Monolithic 3D ICs , 2016, ISLPED.

[24]  Yu Wang,et al.  Going Deeper with Embedded FPGA Platform for Convolutional Neural Network , 2016, FPGA.

[25]  Michael Ferdman,et al.  Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer , 2017, 2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM).

[26]  Tor M. Aamodt,et al.  Thread block compaction for efficient SIMT control flow , 2011, 2011 IEEE 17th International Symposium on High Performance Computer Architecture.

[27]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[28]  Xiaowei Li,et al.  FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks , 2017, 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[29]  Mengjia Yan,et al.  UCNN: Exploiting Computational Reuse in Deep Neural Networks via Weight Repetition , 2018, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[30]  Jung Ho Ahn,et al.  Chameleon: Versatile and practical near-DRAM acceleration architecture for large memory systems , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[31]  Michael Ferdman,et al.  Overcoming resource underutilization in spatial CNN accelerators , 2016, 2016 26th International Conference on Field Programmable Logic and Applications (FPL).

[32]  Shaoli Liu,et al.  Cambricon-X: An accelerator for sparse neural networks , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[33]  Jason Cong,et al.  Caffeine: Towards uniformed representation and acceleration for deep convolutional neural networks , 2016, 2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD).

[34]  Pradeep Dubey,et al.  SCALEDEEP: A scalable compute architecture for learning and evaluating deep networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[35]  Niraj K. Jha,et al.  Hybrid Monolithic 3-D IC Floorplanner , 2018, IEEE Transactions on Very Large Scale Integration (VLSI) Systems.

[36]  Thomas F. La Porta,et al.  Modeling the Resource Requirements of Convolutional Neural Networks on Mobile Devices , 2017, ACM Multimedia.

[37]  Denis Foley,et al.  Ultra-Performance Pascal GPU and NVLink Interconnect , 2017, IEEE Micro.

[38]  Sung Kyu Lim,et al.  Ultra-high density 3D SRAM cell designs for monolithic 3D integration , 2012, 2012 IEEE International Interconnect Technology Conference.

[39]  Hao Wu,et al.  Mixed Precision Training , 2017, ICLR.

[40]  Scott A. Mahlke,et al.  Scalpel: Customizing DNN pruning to the underlying hardware parallelism , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[41]  O. Rozeau,et al.  Compact 6T SRAM cell with robust read/write stabilizing design in 45nm Monolithic 3D IC technology , 2009, 2009 IEEE International Conference on IC Design and Technology.

[42]  Song Han,et al.  EIE: Efficient Inference Engine on Compressed Deep Neural Network , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[43]  Stephen W. Keckler,et al.  Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks , 2017, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA).

[44]  Jian Sun,et al.  Identity Mappings in Deep Residual Networks , 2016, ECCV.

[45]  Jia Wang,et al.  DaDianNao: A Machine-Learning Supercomputer , 2014, 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture.

[46]  Niraj K. Jha,et al.  Energy-Efficient Monolithic Three-Dimensional On-Chip Memory Architectures , 2018, IEEE Transactions on Nanotechnology.

[47]  Olivier Giroux,et al.  Volta: Performance and Programmability , 2018, IEEE Micro.

[48]  Song Han,et al.  Deep Compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding , 2015, ICLR.

[49]  Sung Kyu Lim,et al.  Power-performance study of block-level monolithic 3D-ICs considering inter-tier performance variations , 2014, 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC).

[50]  Pradeep Dubey,et al.  Mixed Precision Training of Convolutional Neural Networks using Integer Operations , 2018, ICLR.

[51]  David A. Patterson,et al.  In-datacenter performance analysis of a tensor processing unit , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[52]  Miao Hu,et al.  ISAAC: A Convolutional Neural Network Accelerator with In-Situ Analog Arithmetic in Crossbars , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[53]  Nikko Strom,et al.  Scalable distributed DNN training using commodity GPU cloud computing , 2015, INTERSPEECH.

[54]  Sergey Ioffe,et al.  Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning , 2016, AAAI.

[55]  Olivier Billoint,et al.  Intermediate BEOL process influence on power and performance for 3DVLSI , 2015, 2015 International 3D Systems Integration Conference (3DIC).

[56]  Edward Y. Chang,et al.  Distributed Training Large-Scale Deep Architectures , 2017, ADMA.

[57]  William J. Dally,et al.  SCNN: An accelerator for compressed-sparse convolutional neural networks , 2017, 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA).

[58]  Miriam Bellver,et al.  Distributed training strategies for a computer vision deep learning algorithm on a distributed GPU cluster , 2017, ICCS.

[59]  Yoshua Bengio,et al.  Training deep neural networks with low precision multiplications , 2014 .

[60]  Cong Xu,et al.  NVSim: A Circuit-Level Performance, Energy, and Area Model for Emerging Nonvolatile Memory , 2012, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems.

[61]  Onur Mutlu,et al.  Improving GPU performance via large warps and two-level warp scheduling , 2011, 2011 44th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[62]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[63]  Sung Kyu Lim,et al.  How to Cope with Slow Transistors in the Top-tier of Monolithic 3D ICs: Design Studies and CAD Solutions , 2016, ISLPED.

[64]  Norman P. Jouppi,et al.  CACTI 6.0: A Tool to Model Large Caches , 2009 .

[65]  Sergey Ioffe,et al.  Rethinking the Inception Architecture for Computer Vision , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[66]  Christoforos E. Kozyrakis,et al.  TETRIS: Scalable and Efficient Neural Network Acceleration with 3D Memory , 2017, ASPLOS.

[67]  Krste Asanovic,et al.  Convergence and scalarization for data-parallel architectures , 2013, Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO).

[68]  Tianshi Chen,et al.  Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach , 2018, 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[69]  Ninghui Sun,et al.  DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning , 2014, ASPLOS.

[70]  Tao Zhang,et al.  NVMain 2.0: A User-Friendly Memory Simulator to Model (Non-)Volatile Memory Systems , 2015, IEEE Computer Architecture Letters.

[71]  Niraj K. Jha,et al.  A Monolithic 3D Hybrid Architecture for Energy-Efficient Computation , 2018, IEEE Transactions on Multi-Scale Computing Systems.

[72]  D. Williamson Dynamically scaled fixed point arithmetic , 1991, [1991] IEEE Pacific Rim Conference on Communications, Computers and Signal Processing Conference Proceedings.

[73]  Song Han,et al.  Learning both Weights and Connections for Efficient Neural Network , 2015, NIPS.

[74]  Chi-Ying Tsui,et al.  SparseNN: An energy-efficient neural network accelerator exploiting input and output sparsity , 2017, 2018 Design, Automation & Test in Europe Conference & Exhibition (DATE).

[75]  Manoj Alwani,et al.  Fused-layer CNN accelerators , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[76]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[77]  Niraj K. Jha,et al.  Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture , 2019, IEEE Transactions on Computers.

[78]  Tor M. Aamodt,et al.  Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow , 2007, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007).

[79]  Heng-Yuan Lee,et al.  A 5ns fast write multi-level non-volatile 1 K bits RRAM memory with advance write scheme , 2009, 2009 Symposium on VLSI Circuits.

[80]  Chao Wang,et al.  CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-Circulant Weight Matrices , 2017, 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).

[81]  Yu Wang,et al.  PRIME: A Novel Processing-in-Memory Architecture for Neural Network Computation in ReRAM-Based Main Memory , 2016, 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA).

[82]  Alireza Shafaei,et al.  FinCACTI: Architectural Analysis and Modeling of Caches with Deeply-Scaled FinFET Devices , 2014, 2014 IEEE Computer Society Annual Symposium on VLSI.

[83]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[84]  Mark Sandler,et al.  MobileNetV2: Inverted Residuals and Linear Bottlenecks , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.

[85]  Hadi Esmaeilzadeh,et al.  Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Network , 2017, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA).

[86]  Patrick Judd,et al.  Stripes: Bit-serial deep neural network computing , 2016, 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO).